These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .

Press the right arrow to progress to the next slide!

1/25

ETC5521: Exploratory Data Analysis

Working with a single variable, making transformations, detecting outliers, using robust statistics

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 4 - Session 2

1/25

Bins and Bandwidths2/25

Case study 3 Boston housing data Part 1/4

data(bostonc, package = "DAAG")
df3 <- read_tsv(bostonc[10:length(bostonc)]) 
skimr::skim(df3)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df3   
## Number of rows             506   
## Number of columns          21    
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  19    
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 TOWN                  0             1     4    23     0       92          0
## 2 TRACT                 0             1     4     4     0      506          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate     mean       sd        p0      p25     p50     p75    p100 hist 
##  1 OBS.                  0             1 254.     146.       1       127.     254.    380.    506     ▇▇▇▇▇
##  2 TOWN#                 0             1  47.5     27.6      0        26.2     42      78      91     ▅▆▅▃▇
##  3 LON                   0             1 -71.1      0.0754 -71.3     -71.1    -71.1   -71.0   -70.8   ▁▂▇▂▁
##  4 LAT                   0             1  42.2      0.0618  42.0      42.2     42.2    42.3    42.4   ▁▃▇▃▁
##  5 MEDV                  0             1  22.5      9.20     5        17.0     21.2    25      50     ▂▇▅▁▁
##  6 CMEDV                 0             1  22.5      9.18     5        17.0     21.2    25      50     ▂▇▅▁▁
##  7 CRIM                  0             1   3.61     8.60     0.00632   0.0820   0.257   3.68   89.0   ▇▁▁▁▁
##  8 ZN                    0             1  11.4     23.3      0         0        0      12.5   100     ▇▁▁▁▁
##  9 INDUS                 0             1  11.1      6.86     0.46      5.19     9.69   18.1    27.7   ▇▆▁▇▁
## 10 CHAS                  0             1   0.0692   0.254    0         0        0       0       1     ▇▁▁▁▁
## 11 NOX                   0             1   0.555    0.116    0.385     0.449    0.538   0.624   0.871 ▇▇▆▅▁
## 12 RM                    0             1   6.28     0.703    3.56      5.89     6.21    6.62    8.78  ▁▂▇▂▁
## 13 AGE                   0             1  68.6     28.1      2.9      45.0     77.5    94.1   100     ▂▂▂▃▇
## 14 DIS                   0             1   3.80     2.11     1.13      2.10     3.21    5.19   12.1   ▇▅▂▁▁
## 15 RAD                   0             1   9.55     8.71     1         4        5      24      24     ▇▂▁▁▃
## 16 TAX                   0             1 408.     169.     187       279      330     666     711     ▇▇▃▁▇
## 17 PTRATIO               0             1  18.5      2.16    12.6      17.4     19.0    20.2    22     ▁▃▅▅▇
## 18 B                     0             1 357.      91.3      0.32    375.     391.    396.    397.    ▁▁▁▁▇
## 19 LSTAT                 0             1  12.7      7.14     1.73      6.95    11.4    17.0    38.0   ▇▇▅▂▁

ggplot(df3, aes(MEDV)) + 
  geom_histogram(binwidth = 1, color = "black", fill = "#008A25") + 
  labs(x = "Median housing value (US$1000)", y = "Frequency")

Harrison, David, and Daniel L. Rubinfeld (1978) Hedonic Housing Prices and the Demand for Clean Air, Journal of Environmental Economics and Management 5 81-102. Original data.
Gilley, O.W. and R. Kelley Pace (1996) On the Harrison and Rubinfeld Data. Journal of Environmental Economics and Management 31 403-405. Provided corrections and examined censoring.
Maindonald, John H. and Braun, W. John (2020). DAAG: Data Analysis and Graphics Data and Functions. R package version 1.24

3/25

Case study 3 Boston housing data Part 1/4

Thre is a large frequency in the final bin.
There is a decline in observations in the $40-49K range as well as dip in observations around $26K and $34K.
The histogram is using a bin width of 1 unit and is left-open (or right-closed): (4.5, 5.5], (5.5, 6.5] ... (49.5, 50.5].
Occasionally, whether it is left- or right-open can make a difference.

data(bostonc, package = "DAAG")
df3 <- read_tsv(bostonc[10:length(bostonc)]) 
skimr::skim(df3)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df3   
## Number of rows             506   
## Number of columns          21    
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  19    
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 TOWN                  0             1     4    23     0       92          0
## 2 TRACT                 0             1     4     4     0      506          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate     mean       sd        p0      p25     p50     p75    p100 hist 
##  1 OBS.                  0             1 254.     146.       1       127.     254.    380.    506     ▇▇▇▇▇
##  2 TOWN#                 0             1  47.5     27.6      0        26.2     42      78      91     ▅▆▅▃▇
##  3 LON                   0             1 -71.1      0.0754 -71.3     -71.1    -71.1   -71.0   -70.8   ▁▂▇▂▁
##  4 LAT                   0             1  42.2      0.0618  42.0      42.2     42.2    42.3    42.4   ▁▃▇▃▁
##  5 MEDV                  0             1  22.5      9.20     5        17.0     21.2    25      50     ▂▇▅▁▁
##  6 CMEDV                 0             1  22.5      9.18     5        17.0     21.2    25      50     ▂▇▅▁▁
##  7 CRIM                  0             1   3.61     8.60     0.00632   0.0820   0.257   3.68   89.0   ▇▁▁▁▁
##  8 ZN                    0             1  11.4     23.3      0         0        0      12.5   100     ▇▁▁▁▁
##  9 INDUS                 0             1  11.1      6.86     0.46      5.19     9.69   18.1    27.7   ▇▆▁▇▁
## 10 CHAS                  0             1   0.0692   0.254    0         0        0       0       1     ▇▁▁▁▁
## 11 NOX                   0             1   0.555    0.116    0.385     0.449    0.538   0.624   0.871 ▇▇▆▅▁
## 12 RM                    0             1   6.28     0.703    3.56      5.89     6.21    6.62    8.78  ▁▂▇▂▁
## 13 AGE                   0             1  68.6     28.1      2.9      45.0     77.5    94.1   100     ▂▂▂▃▇
## 14 DIS                   0             1   3.80     2.11     1.13      2.10     3.21    5.19   12.1   ▇▅▂▁▁
## 15 RAD                   0             1   9.55     8.71     1         4        5      24      24     ▇▂▁▁▃
## 16 TAX                   0             1 408.     169.     187       279      330     666     711     ▇▇▃▁▇
## 17 PTRATIO               0             1  18.5      2.16    12.6      17.4     19.0    20.2    22     ▁▃▅▅▇
## 18 B                     0             1 357.      91.3      0.32    375.     391.    396.    397.    ▁▁▁▁▇
## 19 LSTAT                 0             1  12.7      7.14     1.73      6.95    11.4    17.0    38.0   ▇▇▅▂▁

ggplot(df3, aes(MEDV)) + 
  geom_histogram(binwidth = 1, color = "black", fill = "#008A25") + 
  labs(x = "Median housing value (US$1000)", y = "Frequency")

3/25

Case study 3 Boston housing data Part 2/4

Density plots depend on the bandwidth chosen and more than often do not estimate well at boundary cases
There are various way to present features of the data using a plot and what works for one person, may not be as straightforward for another
Be prepared to do multiple plots!

data(bostonc, package = "DAAG")
df3 <- read_tsv(bostonc[10:length(bostonc)]) 
skimr::skim(df3)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df3   
## Number of rows             506   
## Number of columns          21    
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  19    
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 TOWN                  0             1     4    23     0       92          0
## 2 TRACT                 0             1     4     4     0      506          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate     mean       sd        p0      p25     p50     p75    p100 hist 
##  1 OBS.                  0             1 254.     146.       1       127.     254.    380.    506     ▇▇▇▇▇
##  2 TOWN#                 0             1  47.5     27.6      0        26.2     42      78      91     ▅▆▅▃▇
##  3 LON                   0             1 -71.1      0.0754 -71.3     -71.1    -71.1   -71.0   -70.8   ▁▂▇▂▁
##  4 LAT                   0             1  42.2      0.0618  42.0      42.2     42.2    42.3    42.4   ▁▃▇▃▁
##  5 MEDV                  0             1  22.5      9.20     5        17.0     21.2    25      50     ▂▇▅▁▁
##  6 CMEDV                 0             1  22.5      9.18     5        17.0     21.2    25      50     ▂▇▅▁▁
##  7 CRIM                  0             1   3.61     8.60     0.00632   0.0820   0.257   3.68   89.0   ▇▁▁▁▁
##  8 ZN                    0             1  11.4     23.3      0         0        0      12.5   100     ▇▁▁▁▁
##  9 INDUS                 0             1  11.1      6.86     0.46      5.19     9.69   18.1    27.7   ▇▆▁▇▁
## 10 CHAS                  0             1   0.0692   0.254    0         0        0       0       1     ▇▁▁▁▁
## 11 NOX                   0             1   0.555    0.116    0.385     0.449    0.538   0.624   0.871 ▇▇▆▅▁
## 12 RM                    0             1   6.28     0.703    3.56      5.89     6.21    6.62    8.78  ▁▂▇▂▁
## 13 AGE                   0             1  68.6     28.1      2.9      45.0     77.5    94.1   100     ▂▂▂▃▇
## 14 DIS                   0             1   3.80     2.11     1.13      2.10     3.21    5.19   12.1   ▇▅▂▁▁
## 15 RAD                   0             1   9.55     8.71     1         4        5      24      24     ▇▂▁▁▃
## 16 TAX                   0             1 408.     169.     187       279      330     666     711     ▇▇▃▁▇
## 17 PTRATIO               0             1  18.5      2.16    12.6      17.4     19.0    20.2    22     ▁▃▅▅▇
## 18 B                     0             1 357.      91.3      0.32    375.     391.    396.    397.    ▁▁▁▁▇
## 19 LSTAT                 0             1  12.7      7.14     1.73      6.95    11.4    17.0    38.0   ▇▇▅▂▁

ggplot(df3, aes(MEDV, y = "")) + 
  geom_boxplot(fill = "#008A25") + 
  labs(x = "Median housing value (US$1000)", y = "") + 
  theme(axis.line.y = element_blank())
ggplot(df3, aes(MEDV, y = "")) + 
  geom_jitter() + 
  labs(x = "Median housing value (US$1000)", y = "") + 
  theme(axis.line.y = element_blank())
ggplot(df3, aes(MEDV)) + 
  geom_density() + 
  geom_rug() + 
  labs(x = "Median housing value (US$1000)", y = "") + 
  theme(axis.line.y = element_blank())

4/25

Case study 3 Boston housing data Part 3/4

data(bostonc, package = "DAAG")
df3 <- read_tsv(bostonc[10:length(bostonc)]) 
skimr::skim(df3)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df3   
## Number of rows             506   
## Number of columns          21    
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  19    
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 TOWN                  0             1     4    23     0       92          0
## 2 TRACT                 0             1     4     4     0      506          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate     mean       sd        p0      p25     p50     p75    p100 hist 
##  1 OBS.                  0             1 254.     146.       1       127.     254.    380.    506     ▇▇▇▇▇
##  2 TOWN#                 0             1  47.5     27.6      0        26.2     42      78      91     ▅▆▅▃▇
##  3 LON                   0             1 -71.1      0.0754 -71.3     -71.1    -71.1   -71.0   -70.8   ▁▂▇▂▁
##  4 LAT                   0             1  42.2      0.0618  42.0      42.2     42.2    42.3    42.4   ▁▃▇▃▁
##  5 MEDV                  0             1  22.5      9.20     5        17.0     21.2    25      50     ▂▇▅▁▁
##  6 CMEDV                 0             1  22.5      9.18     5        17.0     21.2    25      50     ▂▇▅▁▁
##  7 CRIM                  0             1   3.61     8.60     0.00632   0.0820   0.257   3.68   89.0   ▇▁▁▁▁
##  8 ZN                    0             1  11.4     23.3      0         0        0      12.5   100     ▇▁▁▁▁
##  9 INDUS                 0             1  11.1      6.86     0.46      5.19     9.69   18.1    27.7   ▇▆▁▇▁
## 10 CHAS                  0             1   0.0692   0.254    0         0        0       0       1     ▇▁▁▁▁
## 11 NOX                   0             1   0.555    0.116    0.385     0.449    0.538   0.624   0.871 ▇▇▆▅▁
## 12 RM                    0             1   6.28     0.703    3.56      5.89     6.21    6.62    8.78  ▁▂▇▂▁
## 13 AGE                   0             1  68.6     28.1      2.9      45.0     77.5    94.1   100     ▂▂▂▃▇
## 14 DIS                   0             1   3.80     2.11     1.13      2.10     3.21    5.19   12.1   ▇▅▂▁▁
## 15 RAD                   0             1   9.55     8.71     1         4        5      24      24     ▇▂▁▁▃
## 16 TAX                   0             1 408.     169.     187       279      330     666     711     ▇▇▃▁▇
## 17 PTRATIO               0             1  18.5      2.16    12.6      17.4     19.0    20.2    22     ▁▃▅▅▇
## 18 B                     0             1 357.      91.3      0.32    375.     391.    396.    397.    ▁▁▁▁▇
## 19 LSTAT                 0             1  12.7      7.14     1.73      6.95    11.4    17.0    38.0   ▇▇▅▂▁

ggplot(df3, aes(PTRATIO)) + 
  geom_histogram(fill = "#9651A0",  color = "black", binwidth = 0.2) + 
  labs(x = "Pupil-teacher ratio by town", y = "",
       title = "Bin width = 0.2, Left-open") 
ggplot(df3, aes(PTRATIO)) + 
  geom_histogram(fill = "#9651A0",  color = "black", binwidth = 0.5) + 
  labs(x = "Pupil-teacher ratio by town", y = "",
       title = "Bin width = 0.5, Left-open") 
ggplot(df3, aes(PTRATIO)) + 
  geom_histogram(fill = "#9651A0",  color = "black", bins = 30) + 
  labs(x = "Pupil-teacher ratio by town", y = "",
       title = "Bin number = 30, Left-open") 
ggplot(df3, aes(PTRATIO)) + 
  geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.2, closed = "left") + 
  labs(x = "Pupil-teacher ratio by town", y = "",
       title = "Bin width = 0.2, Right-open") 
ggplot(df3, aes(PTRATIO)) + 
  geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.5, closed = "left") + 
  labs(x = "Pupil-teacher ratio by town", y = "",
       title = "Bin width = 0.5, Right-open") 
ggplot(df3, aes(PTRATIO)) + 
  geom_histogram(fill = "#9651A0", color = "black",
                 bins = 30, closed = "left") + 
  labs(x = "Pupil-teacher ratio by town", y = "",
       title = "Bin number = 30, Right-open")

5/25

Case study 3 Boston housing data Part 4/4

CRIM: per capita crime rate by town
INDUS: proportion of non-retail business acres per town
NOX: nitrogen oxides concentration (parts per 10 million)
RM: average number of room per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted mean of distances to 5 Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property tax rate per $10K
PTRATIO: pupil-teacher ratio by town
LSTAT: lower status of the population (%)
MEDV: median value of owner-occupied homes in $1000s

df3long <- df3 %>% pivot_longer(MEDV:LSTAT,
                             names_to = "var",
                             values_to = "value") %>% 
  filter(!var %in% c("CHAS", "B", "ZN"))
skimr::skim(df3long)

## ── Data Summary ────────────────────────
##                            Values 
## Name                       df3long
## Number of rows             6072   
## Number of columns          8      
## _______________________           
## Column type frequency:            
##   character                3      
##   numeric                  5      
## ________________________          
## Group variables            None   
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 TOWN                  0             1     4    23     0       92          0
## 2 TRACT                 0             1     4     4     0      506          0
## 3 var                   0             1     2     7     0       12          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean       sd        p0   p25   p50   p75  p100 hist 
## 1 OBS.                  0             1 254.  146.       1       127   254.  380   506   ▇▇▇▇▇
## 2 TOWN#                 0             1  47.5  27.5      0        26    42    78    91   ▅▆▅▃▇
## 3 LON                   0             1 -71.1   0.0753 -71.3     -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁
## 4 LAT                   0             1  42.2   0.0617  42.0      42.2  42.2  42.3  42.4 ▁▃▇▃▁
## 5 value                 0             1  49.0 120.       0.00632   4    12.3  23.4 711   ▇▁▁▁▁

ggplot(df3long, aes(value)) +
  geom_histogram(color = "white") +
  facet_wrap( ~var, scale = "free") + 
  labs(x = "", y = "") + 
  theme(axis.text = element_text(size = 12))

6/25

Case study 4 Hidalgo stamps thickness

A stamp collector, Walton von Winkle, bought several collections of Mexican stamps from 1872-1874 and measured the thickness of all of them.
The different bandwidth for the density plot suggest either that there are two or seven modes.

load(here::here("data/Hidalgo1872.rda"))
skimr::skim(Hidalgo1872)

## ── Data Summary ────────────────────────
##                            Values     
## Name                       Hidalgo1872
## Number of rows             485        
## Number of columns          3          
## _______________________               
## Column type frequency:                
##   numeric                  3          
## ________________________              
## Group variables            None       
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   mean      sd    p0    p25   p50   p75  p100 hist 
## 1 thickness             0         1     0.0860 0.0150  0.06  0.075  0.08  0.098 0.131 ▅▇▃▂▁
## 2 thicknessA          195         0.598 0.0922 0.0162  0.068 0.0772 0.092 0.105 0.131 ▇▃▆▃▂
## 3 thicknessB          289         0.404 0.0768 0.00508 0.06  0.072  0.078 0.08  0.097 ▁▃▇▁▁

ggplot(Hidalgo1872, aes(thickness)) +
  geom_histogram(binwidth = 0.001, aes(y = stat(density))) + 
  labs(x = "Thickness (0.001 mm)", y = "Density") + 
  geom_density(color = "#E16A86", size = 2) + 
  geom_density(color = "#00AD9A", size = 2, bw = "SJ")

7/25

Focus8/25

Case study 5 Movie length

data(movies, package = "ggplot2movies")
skimr::skim(movies)

## ── Data Summary ────────────────────────
##                            Values
## Name                       movies
## Number of rows             58788 
## Number of columns          24    
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  22    
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 title                 0             1     1   121     0    56007          0
## 2 mpaa                  0             1     0     5 53864        5          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate          mean           sd    p0      p25       p50        p75        p100 hist 
##  1 year                  0        1          1976.           23.7    1893   1958      1983       1997        2005   ▁▁▃▃▇
##  2 length                0        1            82.3          44.3       1     74        90        100        5220   ▇▁▁▁▁
##  3 budget            53573        0.0887 13412513.     23350085.        0 250000   3000000   15000000   200000000   ▇▁▁▁▁
##  4 rating                0        1             5.93          1.55      1      5         6.1        7          10   ▁▃▇▆▁
##  5 votes                 0        1           632.         3830.        5     11        30        112      157608   ▇▁▁▁▁
##  6 r1                    0        1             7.01         10.9       0      0         4.5        4.5       100   ▇▁▁▁▁
##  7 r2                    0        1             4.02          5.96      0      0         4.5        4.5        84.5 ▇▁▁▁▁
##  8 r3                    0        1             4.72          6.45      0      0         4.5        4.5        84.5 ▇▁▁▁▁
##  9 r4                    0        1             6.37          7.59      0      0         4.5        4.5       100   ▇▁▁▁▁
## 10 r5                    0        1             9.80          9.73      0      4.5       4.5       14.5       100   ▇▁▁▁▁
## 11 r6                    0        1            13.0          11.0       0      4.5      14.5       14.5        84.5 ▇▂▁▁▁
## 12 r7                    0        1            15.5          11.6       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 13 r8                    0        1            13.9          11.3       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 14 r9                    0        1             8.95          9.44      0      4.5       4.5       14.5       100   ▇▁▁▁▁
## 15 r10                   0        1            16.9          15.7       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 16 Action                0        1             0.0797        0.271     0      0         0          0           1   ▇▁▁▁▁
## 17 Animation             0        1             0.0628        0.243     0      0         0          0           1   ▇▁▁▁▁
## 18 Comedy                0        1             0.294         0.455     0      0         0          1           1   ▇▁▁▁▃
## 19 Drama                 0        1             0.371         0.483     0      0         0          1           1   ▇▁▁▁▅
## 20 Documentary           0        1             0.0591        0.236     0      0         0          0           1   ▇▁▁▁▁
## 21 Romance               0        1             0.0807        0.272     0      0         0          0           1   ▇▁▁▁▁
## 22 Short                 0        1             0.161         0.367     0      0         0          0           1   ▇▁▁▁▂

ggplot(movies, aes(length)) +
  geom_histogram(color = "white") + 
  labs(x = "Length of movie (minutes)", y = "Frequency")
ggplot(movies, aes(length)) +
  geom_histogram(color = "white") + 
  labs(x = "Length of movie (minutes)", y = "Frequency") + 
  scale_x_log10()
movies %>% 
  filter(length < 180) %>% 
  ggplot(aes(length)) +
  geom_histogram(binwidth = 1, fill = "#795549", color = "black") + 
  labs(x = "Length of movie (minutes)", y = "Frequency")

9/25

Case study 5 Movie length

Upon further exploration, you can find the two movies that are well over 16 hours long are "Cure for Insomnia", "Four Stars", and "Longest Most Meaningless Movie in the World"

data(movies, package = "ggplot2movies")
skimr::skim(movies)

## ── Data Summary ────────────────────────
##                            Values
## Name                       movies
## Number of rows             58788 
## Number of columns          24    
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  22    
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 title                 0             1     1   121     0    56007          0
## 2 mpaa                  0             1     0     5 53864        5          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate          mean           sd    p0      p25       p50        p75        p100 hist 
##  1 year                  0        1          1976.           23.7    1893   1958      1983       1997        2005   ▁▁▃▃▇
##  2 length                0        1            82.3          44.3       1     74        90        100        5220   ▇▁▁▁▁
##  3 budget            53573        0.0887 13412513.     23350085.        0 250000   3000000   15000000   200000000   ▇▁▁▁▁
##  4 rating                0        1             5.93          1.55      1      5         6.1        7          10   ▁▃▇▆▁
##  5 votes                 0        1           632.         3830.        5     11        30        112      157608   ▇▁▁▁▁
##  6 r1                    0        1             7.01         10.9       0      0         4.5        4.5       100   ▇▁▁▁▁
##  7 r2                    0        1             4.02          5.96      0      0         4.5        4.5        84.5 ▇▁▁▁▁
##  8 r3                    0        1             4.72          6.45      0      0         4.5        4.5        84.5 ▇▁▁▁▁
##  9 r4                    0        1             6.37          7.59      0      0         4.5        4.5       100   ▇▁▁▁▁
## 10 r5                    0        1             9.80          9.73      0      4.5       4.5       14.5       100   ▇▁▁▁▁
## 11 r6                    0        1            13.0          11.0       0      4.5      14.5       14.5        84.5 ▇▂▁▁▁
## 12 r7                    0        1            15.5          11.6       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 13 r8                    0        1            13.9          11.3       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 14 r9                    0        1             8.95          9.44      0      4.5       4.5       14.5       100   ▇▁▁▁▁
## 15 r10                   0        1            16.9          15.7       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 16 Action                0        1             0.0797        0.271     0      0         0          0           1   ▇▁▁▁▁
## 17 Animation             0        1             0.0628        0.243     0      0         0          0           1   ▇▁▁▁▁
## 18 Comedy                0        1             0.294         0.455     0      0         0          1           1   ▇▁▁▁▃
## 19 Drama                 0        1             0.371         0.483     0      0         0          1           1   ▇▁▁▁▅
## 20 Documentary           0        1             0.0591        0.236     0      0         0          0           1   ▇▁▁▁▁
## 21 Romance               0        1             0.0807        0.272     0      0         0          0           1   ▇▁▁▁▁
## 22 Short                 0        1             0.161         0.367     0      0         0          0           1   ▇▁▁▁▂

ggplot(movies, aes(length)) +
  geom_histogram(color = "white") + 
  labs(x = "Length of movie (minutes)", y = "Frequency")
ggplot(movies, aes(length)) +
  geom_histogram(color = "white") + 
  labs(x = "Length of movie (minutes)", y = "Frequency") + 
  scale_x_log10()
movies %>% 
  filter(length < 180) %>% 
  ggplot(aes(length)) +
  geom_histogram(binwidth = 1, fill = "#795549", color = "black") + 
  labs(x = "Length of movie (minutes)", y = "Frequency")

9/25

Case study 5 Movie length

Upon further exploration, you can find the two movies that are well over 16 hours long are "Cure for Insomnia", "Four Stars", and "Longest Most Meaningless Movie in the World"

We can restrict our attention to films under 3 hours:

data(movies, package = "ggplot2movies")
skimr::skim(movies)

## ── Data Summary ────────────────────────
##                            Values
## Name                       movies
## Number of rows             58788 
## Number of columns          24    
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  22    
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 title                 0             1     1   121     0    56007          0
## 2 mpaa                  0             1     0     5 53864        5          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate          mean           sd    p0      p25       p50        p75        p100 hist 
##  1 year                  0        1          1976.           23.7    1893   1958      1983       1997        2005   ▁▁▃▃▇
##  2 length                0        1            82.3          44.3       1     74        90        100        5220   ▇▁▁▁▁
##  3 budget            53573        0.0887 13412513.     23350085.        0 250000   3000000   15000000   200000000   ▇▁▁▁▁
##  4 rating                0        1             5.93          1.55      1      5         6.1        7          10   ▁▃▇▆▁
##  5 votes                 0        1           632.         3830.        5     11        30        112      157608   ▇▁▁▁▁
##  6 r1                    0        1             7.01         10.9       0      0         4.5        4.5       100   ▇▁▁▁▁
##  7 r2                    0        1             4.02          5.96      0      0         4.5        4.5        84.5 ▇▁▁▁▁
##  8 r3                    0        1             4.72          6.45      0      0         4.5        4.5        84.5 ▇▁▁▁▁
##  9 r4                    0        1             6.37          7.59      0      0         4.5        4.5       100   ▇▁▁▁▁
## 10 r5                    0        1             9.80          9.73      0      4.5       4.5       14.5       100   ▇▁▁▁▁
## 11 r6                    0        1            13.0          11.0       0      4.5      14.5       14.5        84.5 ▇▂▁▁▁
## 12 r7                    0        1            15.5          11.6       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 13 r8                    0        1            13.9          11.3       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 14 r9                    0        1             8.95          9.44      0      4.5       4.5       14.5       100   ▇▁▁▁▁
## 15 r10                   0        1            16.9          15.7       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 16 Action                0        1             0.0797        0.271     0      0         0          0           1   ▇▁▁▁▁
## 17 Animation             0        1             0.0628        0.243     0      0         0          0           1   ▇▁▁▁▁
## 18 Comedy                0        1             0.294         0.455     0      0         0          1           1   ▇▁▁▁▃
## 19 Drama                 0        1             0.371         0.483     0      0         0          1           1   ▇▁▁▁▅
## 20 Documentary           0        1             0.0591        0.236     0      0         0          0           1   ▇▁▁▁▁
## 21 Romance               0        1             0.0807        0.272     0      0         0          0           1   ▇▁▁▁▁
## 22 Short                 0        1             0.161         0.367     0      0         0          0           1   ▇▁▁▁▂

ggplot(movies, aes(length)) +
  geom_histogram(color = "white") + 
  labs(x = "Length of movie (minutes)", y = "Frequency")
ggplot(movies, aes(length)) +
  geom_histogram(color = "white") + 
  labs(x = "Length of movie (minutes)", y = "Frequency") + 
  scale_x_log10()
movies %>% 
  filter(length < 180) %>% 
  ggplot(aes(length)) +
  geom_histogram(binwidth = 1, fill = "#795549", color = "black") + 
  labs(x = "Length of movie (minutes)", y = "Frequency")

9/25

Case study 5 Movie length

Upon further exploration, you can find the two movies that are well over 16 hours long are "Cure for Insomnia", "Four Stars", and "Longest Most Meaningless Movie in the World"

We can restrict our attention to films under 3 hours:

Notice that there is a peak at particular times. Why do you think so?

data(movies, package = "ggplot2movies")
skimr::skim(movies)

## ── Data Summary ────────────────────────
##                            Values
## Name                       movies
## Number of rows             58788 
## Number of columns          24    
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  22    
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 title                 0             1     1   121     0    56007          0
## 2 mpaa                  0             1     0     5 53864        5          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate          mean           sd    p0      p25       p50        p75        p100 hist 
##  1 year                  0        1          1976.           23.7    1893   1958      1983       1997        2005   ▁▁▃▃▇
##  2 length                0        1            82.3          44.3       1     74        90        100        5220   ▇▁▁▁▁
##  3 budget            53573        0.0887 13412513.     23350085.        0 250000   3000000   15000000   200000000   ▇▁▁▁▁
##  4 rating                0        1             5.93          1.55      1      5         6.1        7          10   ▁▃▇▆▁
##  5 votes                 0        1           632.         3830.        5     11        30        112      157608   ▇▁▁▁▁
##  6 r1                    0        1             7.01         10.9       0      0         4.5        4.5       100   ▇▁▁▁▁
##  7 r2                    0        1             4.02          5.96      0      0         4.5        4.5        84.5 ▇▁▁▁▁
##  8 r3                    0        1             4.72          6.45      0      0         4.5        4.5        84.5 ▇▁▁▁▁
##  9 r4                    0        1             6.37          7.59      0      0         4.5        4.5       100   ▇▁▁▁▁
## 10 r5                    0        1             9.80          9.73      0      4.5       4.5       14.5       100   ▇▁▁▁▁
## 11 r6                    0        1            13.0          11.0       0      4.5      14.5       14.5        84.5 ▇▂▁▁▁
## 12 r7                    0        1            15.5          11.6       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 13 r8                    0        1            13.9          11.3       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 14 r9                    0        1             8.95          9.44      0      4.5       4.5       14.5       100   ▇▁▁▁▁
## 15 r10                   0        1            16.9          15.7       0      4.5      14.5       24.5       100   ▇▃▁▁▁
## 16 Action                0        1             0.0797        0.271     0      0         0          0           1   ▇▁▁▁▁
## 17 Animation             0        1             0.0628        0.243     0      0         0          0           1   ▇▁▁▁▁
## 18 Comedy                0        1             0.294         0.455     0      0         0          1           1   ▇▁▁▁▃
## 19 Drama                 0        1             0.371         0.483     0      0         0          1           1   ▇▁▁▁▅
## 20 Documentary           0        1             0.0591        0.236     0      0         0          0           1   ▇▁▁▁▁
## 21 Romance               0        1             0.0807        0.272     0      0         0          0           1   ▇▁▁▁▁
## 22 Short                 0        1             0.161         0.367     0      0         0          0           1   ▇▁▁▁▂

ggplot(movies, aes(length)) +
  geom_histogram(color = "white") + 
  labs(x = "Length of movie (minutes)", y = "Frequency")
ggplot(movies, aes(length)) +
  geom_histogram(color = "white") + 
  labs(x = "Length of movie (minutes)", y = "Frequency") + 
  scale_x_log10()
movies %>% 
  filter(length < 180) %>% 
  ggplot(aes(length)) +
  geom_histogram(binwidth = 1, fill = "#795549", color = "black") + 
  labs(x = "Length of movie (minutes)", y = "Frequency")

9/25

Categorical variables

This lecture is based on Chapter 4 of

Unwin (2015) Graphical Data Analysis with R

10/25

There are two types of categorical variables11/25

There are two types of categorical variables

Nominal where there is no intrinsic ordering to the categories
E.g. blue, grey, black, white.

11/25

There are two types of categorical variables

Nominal where there is no intrinsic ordering to the categories
E.g. blue, grey, black, white.

Ordinal where there is a clear order to the categories.
E.g. Strongly disagree, disagree, neutral, agree, strongly agree.

11/25

Categorical variables in RIn R, categorical variables may be encoded as factors.data <- c(2, 2, 1, 1, 3, 3, 3, 1)
factor(data)

## [1] 2 2 1 1 3 3 3 1
## Levels: 1 2 3
You can easily change the labels of the variables:factor(data, labels = c("I", "II", "III"))

## [1] II  II  I   I   III III III I  
## Levels: I II III

12/25

Categorical variables in RIn R, categorical variables may be encoded as factors.data <- c(2, 2, 1, 1, 3, 3, 3, 1)
factor(data)

## [1] 2 2 1 1 3 3 3 1
## Levels: 1 2 3
You can easily change the labels of the variables:factor(data, labels = c("I", "II", "III"))

## [1] II  II  I   I   III III III I  
## Levels: I II III

Order of the factors are determined by the input:

# numerical input are ordered in increasing order
factor(c(1, 3, 10))

## [1] 1  3  10
## Levels: 1 3 10
# character input are ordered alphabetically
factor(c("1", "3", "10"))

## [1] 1  3  10
## Levels: 1 10 3
# you can specify order of levels explicitly
factor(c("1", "3", "10"),  
       levels = c("1", "3", "10"))

## [1] 1  3  10
## Levels: 1 3 10
12/25

Numerical factors in R

x <- factor(c(10, 20, 30, 10, 20))
mean(x)

## Warning in mean.default(x): argument is not numeric or logical: returning NA

## [1] NA

13/25

Numerical factors in R

x <- factor(c(10, 20, 30, 10, 20))
mean(x)

## Warning in mean.default(x): argument is not numeric or logical: returning NA

## [1] NA

as.numeric function returns the internal integer values of the factor

mean(as.numeric(x))

## [1] 1.8

13/25

Numerical factors in R

x <- factor(c(10, 20, 30, 10, 20))
mean(x)

## Warning in mean.default(x): argument is not numeric or logical: returning NA

## [1] NA

as.numeric function returns the internal integer values of the factor

mean(as.numeric(x))

## [1] 1.8

You probably want to use:

mean(as.numeric(levels(x)[x]))

## [1] 18

mean(as.numeric(as.character(x)))

## [1] 18

13/25

Revisiting Case study 1 2019 Australian Federal Election

df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"), 
                skip = 1,
                col_types = cols(
                      .default = col_character(),
                      OrdinaryVotes = col_double(),
                      AbsentVotes = col_double(),
                      ProvisionalVotes = col_double(),
                      PrePollVotes = col_double(),
                      PostalVotes = col_double(),
                      TotalVotes = col_double(),
                      Swing = col_double()))
tdf3 <- df1 %>% 
  group_by(DivisionID) %>% 
  summarise(DivisionNm = unique(DivisionNm),
            State = unique(StateAb),
            votes_GRN = TotalVotes[which(PartyAb=="GRN")],
            votes_total = sum(TotalVotes)) %>% 
  mutate(perc_GRN = votes_GRN / votes_total * 100)

tdf3 %>% 
  ggplot(aes(perc_GRN, State)) +
  ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
  labs(x = "Percentage of first preference votes per division",
       y = "State", 
       title = "First preference votes for the Greens party")
tdf3 %>% 
  mutate(State = fct_reorder(State, perc_GRN)) %>% 
  ggplot(aes(perc_GRN, State)) +
  ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
  labs(x = "Percentage of first preference votes per division",
       y = "State", 
       title = "First preference votes for the Greens party")

14/25

Order nominal variables meaningfully

Coding tip: use below functions to easily change the order of factor levels

stats::reorder(factor, value, mean)
forcats::fct_reorder(factor, value, median)
forcats::fct_reorder2(factor, value1, value2, func)

15/25

Case study 6 Aspirin use after heart attack

Meta-analysis is a statistical analysis that combines the results of multiple scientific studies.
This data studies the use of aspirin for death prevention after myocardial infarction, or in plain terms, a heart attack.
The ISIS-2 study has more patients than all other studies combined.
You could consider lumping the categories with low frequencies together.

data("Fleiss93", package = "meta")
df6 <- Fleiss93 %>% 
  mutate(total = n.e + n.c)
skimr::skim(df6)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df6   
## Number of rows             7     
## Number of columns          7     
## _______________________          
## Column type frequency:           
##   character                1     
##   numeric                  6     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 study                 0             1     3     6     0        7          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean      sd    p0    p25   p50   p75  p100 hist 
## 1 year                  0             1 1979.    4.39  1974 1978.   1979 1980   1988 ▇▇▇▁▃
## 2 event.e               0             1  304   563.      32   46.5    85  174   1570 ▇▁▁▁▁
## 3 n.e                   0             1 2027. 2959.     317  686.    810 1550.  8587 ▇▂▁▁▂
## 4 event.c               0             1  327.  618.      38   58      67  172.  1720 ▇▁▁▁▁
## 5 n.c                   0             1 1974. 2993.     309  515     771 1554.  8600 ▇▂▁▁▂
## 6 total                 0             1 4000. 5950.     626 1228.   1529 3103  17187 ▇▂▁▁▂

df6 %>% 
  mutate(study = fct_reorder(study, desc(total))) %>% 
  ggplot(aes(study, total)) + 
  geom_col() + 
  labs(x = "", y = "Frequency") + 
  guides(x = guide_axis(n.dodge = 2))
df6 %>% 
  mutate(study = ifelse(total < 2000, "Other", study),
         study = fct_reorder(study, desc(total))) %>% 
  ggplot(aes(study, total)) + 
  geom_col() + 
  labs(x = "", y = "Frequency")

Fleiss JL (1993): The statistical basis of meta-analysis. Statistical Methods in Medical Research 2 121–145
Balduzzi S, Rücker G, Schwarzer G (2019), How to perform a meta-analysis with R: a practical tutorial, Evidence-Based Mental Health.

16/25

Consider combining factor levels with low frequencies

Coding tip: the following family of functions help to easily lump factor levels together:

forcats::fct_lump()
forcats::fct_lump_lowfreq()
forcats::fct_lump_min()
forcats::fct_lump_n()
forcats::fct_lump_prop()
# if conditioned on another variable
ifelse(cond, "Other", factor)
dplyr::case_when(cond1 ~ "level1",
                 cond2 ~ "level2",
                 TRUE ~ "Other")

17/25

Case study 7 Anorexia

Treatment	Frequency
CBT	29
Cont	26
FT	17

Table or Plot?

data(anorexia, package = "MASS")
df9tab <- table(anorexia$Treat) %>% 
  as.data.frame() %>% 
  rename(Treatment = Var1, Frequency = Freq)
skimr::skim(anorexia)

## ── Data Summary ────────────────────────
##                            Values  
## Name                       anorexia
## Number of rows             72      
## Number of columns          3       
## _______________________            
## Column type frequency:             
##   factor                   1       
##   numeric                  2       
## ________________________           
## Group variables            None    
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts              
## 1 Treat                 0             1 FALSE          3 CBT: 29, Con: 26, FT: 17
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 Prewt                 0             1  82.4  5.18  70    79.6  82.3  86    94.9 ▂▅▇▆▁
## 2 Postwt                0             1  85.2  8.04  71.3  79.3  84.1  91.6 104.  ▆▇▅▆▂

ggplot(anorexia, aes(Treat)) + 
  geom_bar() + 
  labs(x = "", y = "Frequency")

ggplot(anorexia, aes(Treat)) + 
  stat_count(geom = "point", size = 4) +
  stat_count(geom = "line", group = 1) +
  labs(y = "Frequency", x = "")

Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993) A Handbook of Small Data Sets. Chapman & Hall, Data set 285 (p. 229)
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

18/25

Case study 7 Anorexia

Treatment	Frequency
CBT	29
Cont	26
FT	17

Table or Plot?

Table for accuracy, plot for visual communication

data(anorexia, package = "MASS")
df9tab <- table(anorexia$Treat) %>% 
  as.data.frame() %>% 
  rename(Treatment = Var1, Frequency = Freq)
skimr::skim(anorexia)

## ── Data Summary ────────────────────────
##                            Values  
## Name                       anorexia
## Number of rows             72      
## Number of columns          3       
## _______________________            
## Column type frequency:             
##   factor                   1       
##   numeric                  2       
## ________________________           
## Group variables            None    
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts              
## 1 Treat                 0             1 FALSE          3 CBT: 29, Con: 26, FT: 17
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 Prewt                 0             1  82.4  5.18  70    79.6  82.3  86    94.9 ▂▅▇▆▁
## 2 Postwt                0             1  85.2  8.04  71.3  79.3  84.1  91.6 104.  ▆▇▅▆▂

ggplot(anorexia, aes(Treat)) + 
  geom_bar() + 
  labs(x = "", y = "Frequency")

ggplot(anorexia, aes(Treat)) + 
  stat_count(geom = "point", size = 4) +
  stat_count(geom = "line", group = 1) +
  labs(y = "Frequency", x = "")

18/25

Case study 7 Anorexia

Treatment	Frequency
CBT	29
Cont	26
FT	17

Table or Plot?

Table for accuracy, plot for visual communication

Why not a point or line?

data(anorexia, package = "MASS")
df9tab <- table(anorexia$Treat) %>% 
  as.data.frame() %>% 
  rename(Treatment = Var1, Frequency = Freq)
skimr::skim(anorexia)

## ── Data Summary ────────────────────────
##                            Values  
## Name                       anorexia
## Number of rows             72      
## Number of columns          3       
## _______________________            
## Column type frequency:             
##   factor                   1       
##   numeric                  2       
## ________________________           
## Group variables            None    
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts              
## 1 Treat                 0             1 FALSE          3 CBT: 29, Con: 26, FT: 17
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 Prewt                 0             1  82.4  5.18  70    79.6  82.3  86    94.9 ▂▅▇▆▁
## 2 Postwt                0             1  85.2  8.04  71.3  79.3  84.1  91.6 104.  ▆▇▅▆▂

ggplot(anorexia, aes(Treat)) + 
  geom_bar() + 
  labs(x = "", y = "Frequency")

ggplot(anorexia, aes(Treat)) + 
  stat_count(geom = "point", size = 4) +
  stat_count(geom = "line", group = 1) +
  labs(y = "Frequency", x = "")

18/25

Case study 7 Anorexia

Treatment	Frequency
CBT	29
Cont	26
FT	17

Table or Plot?

Table for accuracy, plot for visual communication

Why not a point or line?

This can be appropriate depending on what you want to communicate
A barplot occupies more area compared to a point and the area does a better job of communicating size
A line is suggestive of a trend

data(anorexia, package = "MASS")
df9tab <- table(anorexia$Treat) %>% 
  as.data.frame() %>% 
  rename(Treatment = Var1, Frequency = Freq)
skimr::skim(anorexia)

## ── Data Summary ────────────────────────
##                            Values  
## Name                       anorexia
## Number of rows             72      
## Number of columns          3       
## _______________________            
## Column type frequency:             
##   factor                   1       
##   numeric                  2       
## ________________________           
## Group variables            None    
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts              
## 1 Treat                 0             1 FALSE          3 CBT: 29, Con: 26, FT: 17
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 Prewt                 0             1  82.4  5.18  70    79.6  82.3  86    94.9 ▂▅▇▆▁
## 2 Postwt                0             1  85.2  8.04  71.3  79.3  84.1  91.6 104.  ▆▇▅▆▂

ggplot(anorexia, aes(Treat)) + 
  geom_bar() + 
  labs(x = "", y = "Frequency")

ggplot(anorexia, aes(Treat)) + 
  stat_count(geom = "point", size = 4) +
  stat_count(geom = "line", group = 1) +
  labs(y = "Frequency", x = "")

18/25

Case study 8 Titanic

What does the graphs for each categorical variable tell us?

There were more crews than 1st to 3rd class passengers
There were far more males on ship; possibly because majority of crew members were male. You can further explore this by constructing two-way tables or graphs that consider both variables.
Most passengers were adults.
More than two-thirds of passengers died.

df9 <- as_tibble(Titanic)
skimr::skim(df9)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df9   
## Number of rows             32    
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   character                4     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 Class                 0             1     3     4     0        4          0
## 2 Sex                   0             1     4     6     0        2          0
## 3 Age                   0             1     5     5     0        2          0
## 4 Survived              0             1     2     3     0        2          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 n                     0             1  68.8  136.     0  0.75  13.5    77   670 ▇▁▁▁▁

df9 %>% 
  group_by(Class) %>% 
  summarise(total = sum(n)) %>% 
  ggplot(aes(Class, total)) + 
  geom_col(fill = "#ee64a4") + 
  labs(x = "", y = "Frequency") 
df9 %>% 
  group_by(Sex) %>% 
  summarise(total = sum(n)) %>% 
  ggplot(aes(Sex, total)) + 
  geom_col(fill = "#746FB2") + 
  labs(x = "", y = "Frequency") 
df9 %>% 
  group_by(Age) %>% 
  summarise(total = sum(n)) %>% 
  ggplot(aes(Age, total)) + 
  geom_col(fill = "#C8008F") + 
  labs(x = "", y = "Frequency") 
df9 %>% 
  group_by(Survived) %>% 
  summarise(total = sum(n)) %>% 
  ggplot(aes(Survived, total)) + 
  geom_col(fill = "#795549") + 
  labs(x = "Survived", y = "Frequency")

British Board of Trade (1990), Report on the Loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing

19/25

Coloring bars

20/25

Coloring bars

Colour here doesn't add information as the x-axis already tells us about the categories, but colouring bars can make it more visually appealing.
If you have too many categories colour won't work well to differentiate the categories.

20/25

Case study 9 Opinion poll in Ireland Aug 2013

Pie chart is popular in mainstream media but are not generally recommended as people are generally poor at comparing angles.
3D pie charts should definitely be avoided!
Here you can see that there are many people that are "Undecided" for which political party to support and failing to account for this paints a different picture.

df9 <- tibble(party = c("Fine Gael", "Labour", "Fianna Fail",
                         "Sinn Fein", "Indeps", "Green", "Undecided"),
               nos = c(181, 51, 171, 119, 91, 4, 368)) 
df9v2 <- df9 %>% filter(party != "Undecided")
df9

## # A tibble: 7 x 2
##   party         nos
##   <chr>       <dbl>
## 1 Fine Gael     181
## 2 Labour         51
## 3 Fianna Fail   171
## 4 Sinn Fein     119
## 5 Indeps         91
## 6 Green           4
## 7 Undecided     368

g9 <- df9 %>% 
  ggplot(aes("", nos, fill = party)) + 
  geom_col(color = "black") + 
  labs(y = "", x = "") + 
  coord_polar("y") +
  theme(axis.line = element_blank(),
        axis.line.y = element_blank(),
        axis.text = element_blank(),
        panel.grid.major = element_blank()) +
  scale_fill_discrete_qualitative(name = "Party")
g9
g9 %+% df9v2 + 
  # below is needed to keep the same color scheme as before
  scale_fill_manual(values = qualitative_hcl(7)[1:6])

21/25

Piechart is a stacked barplot just with a transformed coordinate system22/25

Piechart is a stacked barplot just with a transformed coordinate system

df <- data.frame(var = c("A", "B", "C"), perc = c(40, 40, 20))
g <- ggplot(df, aes("", perc, fill = var)) + 
  geom_col()
g

22/25

Piechart is a stacked barplot just with a transformed coordinate system

df <- data.frame(var = c("A", "B", "C"), perc = c(40, 40, 20))
g <- ggplot(df, aes("", perc, fill = var)) + 
  geom_col()
g

g + coord_polar("y")

22/25

Roseplot is a barplot just with a transformed coordinate system23/25

Roseplot is a barplot just with a transformed coordinate system

dummy <- data.frame(var = LETTERS[1:20], 
                 n = round(rexp(20, 1/100)))
g <- ggplot(dummy, aes(var, n)) + geom_col(fill = "pink", color = "black")
g

23/25

Roseplot is a barplot just with a transformed coordinate system

dummy <- data.frame(var = LETTERS[1:20], 
                 n = round(rexp(20, 1/100)))
g <- ggplot(dummy, aes(var, n)) + geom_col(fill = "pink", color = "black")
g

g + coord_polar("x") + theme_void()

23/25

Take away messages

24/25

Take away messagesAgain, be prepared to do multiple plots

24/25

Take away messagesAgain, be prepared to do multiple plots
Changing bins or bandwidth in histogram, violin or density plots can paint a different picture

24/25

Take away messagesAgain, be prepared to do multiple plots
Changing bins or bandwidth in histogram, violin or density plots can paint a different picture
Consider different representations of categorical variables (reordering meaningfully, lumping low frequencies together, plot or table, pie or barplot, missing categories)

24/25

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 4 - Session 2

25/25

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

ETC5521: Exploratory Data Analysis

Working with a single variable, making transformations, detecting outliers, using robust statistics

Bins and Bandwidths

Case study 3 Boston housing data Part 1/4

Case study 3 Boston housing data Part 1/4

Case study 3 Boston housing data Part 2/4

Case study 3 Boston housing data Part 3/4

Case study 3 Boston housing data Part 4/4

Case study 4 Hidalgo stamps thickness

Focus

Case study 5 Movie length

Case study 5 Movie length

Case study 5 Movie length

Case study 5 Movie length

Categorical variables

There are two types of categorical variables

There are two types of categorical variables

There are two types of categorical variables

Categorical variables in R

Categorical variables in R

Numerical factors in R

Numerical factors in R

Numerical factors in R

Revisiting Case study 1 2019 Australian Federal Election

Order nominal variables meaningfully

Case study 6 Aspirin use after heart attack

Consider combining factor levels with low frequencies

Case study 7 Anorexia

Case study 7 Anorexia

Case study 7 Anorexia

Case study 7 Anorexia

Case study 8 Titanic

Coloring bars

Coloring bars

Case study 9 Opinion poll in Ireland Aug 2013

Piechart is a stacked barplot just with a transformed coordinate system

Piechart is a stacked barplot just with a transformed coordinate system

Piechart is a stacked barplot just with a transformed coordinate system

Roseplot is a barplot just with a transformed coordinate system

Roseplot is a barplot just with a transformed coordinate system

Roseplot is a barplot just with a transformed coordinate system

Take away messages

Take away messages

Take away messages

Take away messages

ETC5521: Exploratory Data Analysis

Working with a single variable, making transformations, detecting outliers, using robust statistics

Help