+ - 0:00:00
Notes for current slide
Notes for next slide

These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .


Press the right arrow to progress to the next slide!

1/25

ETC5521: Exploratory Data Analysis


Working with a single variable, making transformations, detecting outliers, using robust statistics

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 4 - Session 2


1/25

Bins and Bandwidths

2/25

Case study 3 Boston housing data Part 1/4

data(bostonc, package = "DAAG")
df3 <- read_tsv(bostonc[10:length(bostonc)])
skimr::skim(df3)
## ── Data Summary ────────────────────────
## Values
## Name df3
## Number of rows 506
## Number of columns 21
## _______________________
## Column type frequency:
## character 2
## numeric 19
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 TOWN 0 1 4 23 0 92 0
## 2 TRACT 0 1 4 4 0 506 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 ▇▇▇▇▇
## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 ▅▆▅▃▇
## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁
## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁
## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 ▂▇▅▁▁
## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 ▂▇▅▁▁
## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 ▇▁▁▁▁
## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 ▇▁▁▁▁
## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 ▇▆▁▇▁
## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 ▇▁▁▁▁
## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ▇▇▆▅▁
## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁
## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 ▂▂▂▃▇
## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ▇▅▂▁▁
## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 ▇▂▁▁▃
## 16 TAX 0 1 408. 169. 187 279 330 666 711 ▇▇▃▁▇
## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 ▁▃▅▅▇
## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. ▁▁▁▁▇
## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 ▇▇▅▂▁
ggplot(df3, aes(MEDV)) +
geom_histogram(binwidth = 1, color = "black", fill = "#008A25") +
labs(x = "Median housing value (US$1000)", y = "Frequency")

Harrison, David, and Daniel L. Rubinfeld (1978) Hedonic Housing Prices and the Demand for Clean Air, Journal of Environmental Economics and Management 5 81-102. Original data.
Gilley, O.W. and R. Kelley Pace (1996) On the Harrison and Rubinfeld Data. Journal of Environmental Economics and Management 31 403-405. Provided corrections and examined censoring.
Maindonald, John H. and Braun, W. John (2020). DAAG: Data Analysis and Graphics Data and Functions. R package version 1.24

3/25

Case study 3 Boston housing data Part 1/4

  • Thre is a large frequency in the final bin.
  • There is a decline in observations in the $40-49K range as well as dip in observations around $26K and $34K.
  • The histogram is using a bin width of 1 unit and is left-open (or right-closed): (4.5, 5.5], (5.5, 6.5] ... (49.5, 50.5].
  • Occasionally, whether it is left- or right-open can make a difference.
data(bostonc, package = "DAAG")
df3 <- read_tsv(bostonc[10:length(bostonc)])
skimr::skim(df3)
## ── Data Summary ────────────────────────
## Values
## Name df3
## Number of rows 506
## Number of columns 21
## _______________________
## Column type frequency:
## character 2
## numeric 19
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 TOWN 0 1 4 23 0 92 0
## 2 TRACT 0 1 4 4 0 506 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 ▇▇▇▇▇
## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 ▅▆▅▃▇
## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁
## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁
## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 ▂▇▅▁▁
## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 ▂▇▅▁▁
## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 ▇▁▁▁▁
## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 ▇▁▁▁▁
## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 ▇▆▁▇▁
## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 ▇▁▁▁▁
## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ▇▇▆▅▁
## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁
## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 ▂▂▂▃▇
## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ▇▅▂▁▁
## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 ▇▂▁▁▃
## 16 TAX 0 1 408. 169. 187 279 330 666 711 ▇▇▃▁▇
## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 ▁▃▅▅▇
## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. ▁▁▁▁▇
## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 ▇▇▅▂▁
ggplot(df3, aes(MEDV)) +
geom_histogram(binwidth = 1, color = "black", fill = "#008A25") +
labs(x = "Median housing value (US$1000)", y = "Frequency")

Harrison, David, and Daniel L. Rubinfeld (1978) Hedonic Housing Prices and the Demand for Clean Air, Journal of Environmental Economics and Management 5 81-102. Original data.
Gilley, O.W. and R. Kelley Pace (1996) On the Harrison and Rubinfeld Data. Journal of Environmental Economics and Management 31 403-405. Provided corrections and examined censoring.
Maindonald, John H. and Braun, W. John (2020). DAAG: Data Analysis and Graphics Data and Functions. R package version 1.24

3/25

Case study 3 Boston housing data Part 2/4

  • Density plots depend on the bandwidth chosen and more than often do not estimate well at boundary cases
  • There are various way to present features of the data using a plot and what works for one person, may not be as straightforward for another
  • Be prepared to do multiple plots!
data(bostonc, package = "DAAG")
df3 <- read_tsv(bostonc[10:length(bostonc)])
skimr::skim(df3)
## ── Data Summary ────────────────────────
## Values
## Name df3
## Number of rows 506
## Number of columns 21
## _______________________
## Column type frequency:
## character 2
## numeric 19
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 TOWN 0 1 4 23 0 92 0
## 2 TRACT 0 1 4 4 0 506 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 ▇▇▇▇▇
## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 ▅▆▅▃▇
## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁
## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁
## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 ▂▇▅▁▁
## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 ▂▇▅▁▁
## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 ▇▁▁▁▁
## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 ▇▁▁▁▁
## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 ▇▆▁▇▁
## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 ▇▁▁▁▁
## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ▇▇▆▅▁
## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁
## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 ▂▂▂▃▇
## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ▇▅▂▁▁
## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 ▇▂▁▁▃
## 16 TAX 0 1 408. 169. 187 279 330 666 711 ▇▇▃▁▇
## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 ▁▃▅▅▇
## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. ▁▁▁▁▇
## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 ▇▇▅▂▁
ggplot(df3, aes(MEDV, y = "")) +
geom_boxplot(fill = "#008A25") +
labs(x = "Median housing value (US$1000)", y = "") +
theme(axis.line.y = element_blank())
ggplot(df3, aes(MEDV, y = "")) +
geom_jitter() +
labs(x = "Median housing value (US$1000)", y = "") +
theme(axis.line.y = element_blank())
ggplot(df3, aes(MEDV)) +
geom_density() +
geom_rug() +
labs(x = "Median housing value (US$1000)", y = "") +
theme(axis.line.y = element_blank())
4/25

Case study 3 Boston housing data Part 3/4

data(bostonc, package = "DAAG")
df3 <- read_tsv(bostonc[10:length(bostonc)])
skimr::skim(df3)
## ── Data Summary ────────────────────────
## Values
## Name df3
## Number of rows 506
## Number of columns 21
## _______________________
## Column type frequency:
## character 2
## numeric 19
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 TOWN 0 1 4 23 0 92 0
## 2 TRACT 0 1 4 4 0 506 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 ▇▇▇▇▇
## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 ▅▆▅▃▇
## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁
## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁
## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 ▂▇▅▁▁
## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 ▂▇▅▁▁
## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 ▇▁▁▁▁
## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 ▇▁▁▁▁
## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 ▇▆▁▇▁
## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 ▇▁▁▁▁
## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ▇▇▆▅▁
## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁
## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 ▂▂▂▃▇
## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ▇▅▂▁▁
## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 ▇▂▁▁▃
## 16 TAX 0 1 408. 169. 187 279 330 666 711 ▇▇▃▁▇
## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 ▁▃▅▅▇
## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. ▁▁▁▁▇
## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 ▇▇▅▂▁
ggplot(df3, aes(PTRATIO)) +
geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.2) +
labs(x = "Pupil-teacher ratio by town", y = "",
title = "Bin width = 0.2, Left-open")
ggplot(df3, aes(PTRATIO)) +
geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.5) +
labs(x = "Pupil-teacher ratio by town", y = "",
title = "Bin width = 0.5, Left-open")
ggplot(df3, aes(PTRATIO)) +
geom_histogram(fill = "#9651A0", color = "black", bins = 30) +
labs(x = "Pupil-teacher ratio by town", y = "",
title = "Bin number = 30, Left-open")
ggplot(df3, aes(PTRATIO)) +
geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.2, closed = "left") +
labs(x = "Pupil-teacher ratio by town", y = "",
title = "Bin width = 0.2, Right-open")
ggplot(df3, aes(PTRATIO)) +
geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.5, closed = "left") +
labs(x = "Pupil-teacher ratio by town", y = "",
title = "Bin width = 0.5, Right-open")
ggplot(df3, aes(PTRATIO)) +
geom_histogram(fill = "#9651A0", color = "black",
bins = 30, closed = "left") +
labs(x = "Pupil-teacher ratio by town", y = "",
title = "Bin number = 30, Right-open")
5/25

Case study 3 Boston housing data Part 4/4

  • CRIM: per capita crime rate by town
  • INDUS: proportion of non-retail business acres per town
  • NOX: nitrogen oxides concentration (parts per 10 million)
  • RM: average number of room per dwelling
  • AGE: proportion of owner-occupied units built prior to 1940
  • DIS: weighted mean of distances to 5 Boston employment centres
  • RAD: index of accessibility to radial highways
  • TAX: full-value property tax rate per $10K
  • PTRATIO: pupil-teacher ratio by town
  • LSTAT: lower status of the population (%)
  • MEDV: median value of owner-occupied homes in $1000s
df3long <- df3 %>% pivot_longer(MEDV:LSTAT,
names_to = "var",
values_to = "value") %>%
filter(!var %in% c("CHAS", "B", "ZN"))
skimr::skim(df3long)
## ── Data Summary ────────────────────────
## Values
## Name df3long
## Number of rows 6072
## Number of columns 8
## _______________________
## Column type frequency:
## character 3
## numeric 5
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 TOWN 0 1 4 23 0 92 0
## 2 TRACT 0 1 4 4 0 506 0
## 3 var 0 1 2 7 0 12 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OBS. 0 1 254. 146. 1 127 254. 380 506 ▇▇▇▇▇
## 2 TOWN# 0 1 47.5 27.5 0 26 42 78 91 ▅▆▅▃▇
## 3 LON 0 1 -71.1 0.0753 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁
## 4 LAT 0 1 42.2 0.0617 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁
## 5 value 0 1 49.0 120. 0.00632 4 12.3 23.4 711 ▇▁▁▁▁
ggplot(df3long, aes(value)) +
geom_histogram(color = "white") +
facet_wrap( ~var, scale = "free") +
labs(x = "", y = "") +
theme(axis.text = element_text(size = 12))
6/25

Case study 4 Hidalgo stamps thickness

  • A stamp collector, Walton von Winkle, bought several collections of Mexican stamps from 1872-1874 and measured the thickness of all of them.
  • The different bandwidth for the density plot suggest either that there are two or seven modes.
load(here::here("data/Hidalgo1872.rda"))
skimr::skim(Hidalgo1872)
## ── Data Summary ────────────────────────
## Values
## Name Hidalgo1872
## Number of rows 485
## Number of columns 3
## _______________________
## Column type frequency:
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 thickness 0 1 0.0860 0.0150 0.06 0.075 0.08 0.098 0.131 ▅▇▃▂▁
## 2 thicknessA 195 0.598 0.0922 0.0162 0.068 0.0772 0.092 0.105 0.131 ▇▃▆▃▂
## 3 thicknessB 289 0.404 0.0768 0.00508 0.06 0.072 0.078 0.08 0.097 ▁▃▇▁▁
ggplot(Hidalgo1872, aes(thickness)) +
geom_histogram(binwidth = 0.001, aes(y = stat(density))) +
labs(x = "Thickness (0.001 mm)", y = "Density") +
geom_density(color = "#E16A86", size = 2) +
geom_density(color = "#00AD9A", size = 2, bw = "SJ")
7/25

Focus

8/25

Case study 5 Movie length

data(movies, package = "ggplot2movies")
skimr::skim(movies)
## ── Data Summary ────────────────────────
## Values
## Name movies
## Number of rows 58788
## Number of columns 24
## _______________________
## Column type frequency:
## character 2
## numeric 22
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 title 0 1 1 121 0 56007 0
## 2 mpaa 0 1 0 5 53864 5 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 year 0 1 1976. 23.7 1893 1958 1983 1997 2005 ▁▁▃▃▇
## 2 length 0 1 82.3 44.3 1 74 90 100 5220 ▇▁▁▁▁
## 3 budget 53573 0.0887 13412513. 23350085. 0 250000 3000000 15000000 200000000 ▇▁▁▁▁
## 4 rating 0 1 5.93 1.55 1 5 6.1 7 10 ▁▃▇▆▁
## 5 votes 0 1 632. 3830. 5 11 30 112 157608 ▇▁▁▁▁
## 6 r1 0 1 7.01 10.9 0 0 4.5 4.5 100 ▇▁▁▁▁
## 7 r2 0 1 4.02 5.96 0 0 4.5 4.5 84.5 ▇▁▁▁▁
## 8 r3 0 1 4.72 6.45 0 0 4.5 4.5 84.5 ▇▁▁▁▁
## 9 r4 0 1 6.37 7.59 0 0 4.5 4.5 100 ▇▁▁▁▁
## 10 r5 0 1 9.80 9.73 0 4.5 4.5 14.5 100 ▇▁▁▁▁
## 11 r6 0 1 13.0 11.0 0 4.5 14.5 14.5 84.5 ▇▂▁▁▁
## 12 r7 0 1 15.5 11.6 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 13 r8 0 1 13.9 11.3 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 14 r9 0 1 8.95 9.44 0 4.5 4.5 14.5 100 ▇▁▁▁▁
## 15 r10 0 1 16.9 15.7 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 16 Action 0 1 0.0797 0.271 0 0 0 0 1 ▇▁▁▁▁
## 17 Animation 0 1 0.0628 0.243 0 0 0 0 1 ▇▁▁▁▁
## 18 Comedy 0 1 0.294 0.455 0 0 0 1 1 ▇▁▁▁▃
## 19 Drama 0 1 0.371 0.483 0 0 0 1 1 ▇▁▁▁▅
## 20 Documentary 0 1 0.0591 0.236 0 0 0 0 1 ▇▁▁▁▁
## 21 Romance 0 1 0.0807 0.272 0 0 0 0 1 ▇▁▁▁▁
## 22 Short 0 1 0.161 0.367 0 0 0 0 1 ▇▁▁▁▂
ggplot(movies, aes(length)) +
geom_histogram(color = "white") +
labs(x = "Length of movie (minutes)", y = "Frequency")
ggplot(movies, aes(length)) +
geom_histogram(color = "white") +
labs(x = "Length of movie (minutes)", y = "Frequency") +
scale_x_log10()
movies %>%
filter(length < 180) %>%
ggplot(aes(length)) +
geom_histogram(binwidth = 1, fill = "#795549", color = "black") +
labs(x = "Length of movie (minutes)", y = "Frequency")
9/25

Case study 5 Movie length

  • Upon further exploration, you can find the two movies that are well over 16 hours long are "Cure for Insomnia", "Four Stars", and "Longest Most Meaningless Movie in the World"
data(movies, package = "ggplot2movies")
skimr::skim(movies)
## ── Data Summary ────────────────────────
## Values
## Name movies
## Number of rows 58788
## Number of columns 24
## _______________________
## Column type frequency:
## character 2
## numeric 22
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 title 0 1 1 121 0 56007 0
## 2 mpaa 0 1 0 5 53864 5 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 year 0 1 1976. 23.7 1893 1958 1983 1997 2005 ▁▁▃▃▇
## 2 length 0 1 82.3 44.3 1 74 90 100 5220 ▇▁▁▁▁
## 3 budget 53573 0.0887 13412513. 23350085. 0 250000 3000000 15000000 200000000 ▇▁▁▁▁
## 4 rating 0 1 5.93 1.55 1 5 6.1 7 10 ▁▃▇▆▁
## 5 votes 0 1 632. 3830. 5 11 30 112 157608 ▇▁▁▁▁
## 6 r1 0 1 7.01 10.9 0 0 4.5 4.5 100 ▇▁▁▁▁
## 7 r2 0 1 4.02 5.96 0 0 4.5 4.5 84.5 ▇▁▁▁▁
## 8 r3 0 1 4.72 6.45 0 0 4.5 4.5 84.5 ▇▁▁▁▁
## 9 r4 0 1 6.37 7.59 0 0 4.5 4.5 100 ▇▁▁▁▁
## 10 r5 0 1 9.80 9.73 0 4.5 4.5 14.5 100 ▇▁▁▁▁
## 11 r6 0 1 13.0 11.0 0 4.5 14.5 14.5 84.5 ▇▂▁▁▁
## 12 r7 0 1 15.5 11.6 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 13 r8 0 1 13.9 11.3 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 14 r9 0 1 8.95 9.44 0 4.5 4.5 14.5 100 ▇▁▁▁▁
## 15 r10 0 1 16.9 15.7 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 16 Action 0 1 0.0797 0.271 0 0 0 0 1 ▇▁▁▁▁
## 17 Animation 0 1 0.0628 0.243 0 0 0 0 1 ▇▁▁▁▁
## 18 Comedy 0 1 0.294 0.455 0 0 0 1 1 ▇▁▁▁▃
## 19 Drama 0 1 0.371 0.483 0 0 0 1 1 ▇▁▁▁▅
## 20 Documentary 0 1 0.0591 0.236 0 0 0 0 1 ▇▁▁▁▁
## 21 Romance 0 1 0.0807 0.272 0 0 0 0 1 ▇▁▁▁▁
## 22 Short 0 1 0.161 0.367 0 0 0 0 1 ▇▁▁▁▂
ggplot(movies, aes(length)) +
geom_histogram(color = "white") +
labs(x = "Length of movie (minutes)", y = "Frequency")
ggplot(movies, aes(length)) +
geom_histogram(color = "white") +
labs(x = "Length of movie (minutes)", y = "Frequency") +
scale_x_log10()
movies %>%
filter(length < 180) %>%
ggplot(aes(length)) +
geom_histogram(binwidth = 1, fill = "#795549", color = "black") +
labs(x = "Length of movie (minutes)", y = "Frequency")
9/25

Case study 5 Movie length

  • Upon further exploration, you can find the two movies that are well over 16 hours long are "Cure for Insomnia", "Four Stars", and "Longest Most Meaningless Movie in the World"
  • We can restrict our attention to films under 3 hours:

data(movies, package = "ggplot2movies")
skimr::skim(movies)
## ── Data Summary ────────────────────────
## Values
## Name movies
## Number of rows 58788
## Number of columns 24
## _______________________
## Column type frequency:
## character 2
## numeric 22
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 title 0 1 1 121 0 56007 0
## 2 mpaa 0 1 0 5 53864 5 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 year 0 1 1976. 23.7 1893 1958 1983 1997 2005 ▁▁▃▃▇
## 2 length 0 1 82.3 44.3 1 74 90 100 5220 ▇▁▁▁▁
## 3 budget 53573 0.0887 13412513. 23350085. 0 250000 3000000 15000000 200000000 ▇▁▁▁▁
## 4 rating 0 1 5.93 1.55 1 5 6.1 7 10 ▁▃▇▆▁
## 5 votes 0 1 632. 3830. 5 11 30 112 157608 ▇▁▁▁▁
## 6 r1 0 1 7.01 10.9 0 0 4.5 4.5 100 ▇▁▁▁▁
## 7 r2 0 1 4.02 5.96 0 0 4.5 4.5 84.5 ▇▁▁▁▁
## 8 r3 0 1 4.72 6.45 0 0 4.5 4.5 84.5 ▇▁▁▁▁
## 9 r4 0 1 6.37 7.59 0 0 4.5 4.5 100 ▇▁▁▁▁
## 10 r5 0 1 9.80 9.73 0 4.5 4.5 14.5 100 ▇▁▁▁▁
## 11 r6 0 1 13.0 11.0 0 4.5 14.5 14.5 84.5 ▇▂▁▁▁
## 12 r7 0 1 15.5 11.6 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 13 r8 0 1 13.9 11.3 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 14 r9 0 1 8.95 9.44 0 4.5 4.5 14.5 100 ▇▁▁▁▁
## 15 r10 0 1 16.9 15.7 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 16 Action 0 1 0.0797 0.271 0 0 0 0 1 ▇▁▁▁▁
## 17 Animation 0 1 0.0628 0.243 0 0 0 0 1 ▇▁▁▁▁
## 18 Comedy 0 1 0.294 0.455 0 0 0 1 1 ▇▁▁▁▃
## 19 Drama 0 1 0.371 0.483 0 0 0 1 1 ▇▁▁▁▅
## 20 Documentary 0 1 0.0591 0.236 0 0 0 0 1 ▇▁▁▁▁
## 21 Romance 0 1 0.0807 0.272 0 0 0 0 1 ▇▁▁▁▁
## 22 Short 0 1 0.161 0.367 0 0 0 0 1 ▇▁▁▁▂
ggplot(movies, aes(length)) +
geom_histogram(color = "white") +
labs(x = "Length of movie (minutes)", y = "Frequency")
ggplot(movies, aes(length)) +
geom_histogram(color = "white") +
labs(x = "Length of movie (minutes)", y = "Frequency") +
scale_x_log10()
movies %>%
filter(length < 180) %>%
ggplot(aes(length)) +
geom_histogram(binwidth = 1, fill = "#795549", color = "black") +
labs(x = "Length of movie (minutes)", y = "Frequency")
9/25

Case study 5 Movie length

  • Upon further exploration, you can find the two movies that are well over 16 hours long are "Cure for Insomnia", "Four Stars", and "Longest Most Meaningless Movie in the World"
  • We can restrict our attention to films under 3 hours:

  • Notice that there is a peak at particular times. Why do you think so?
data(movies, package = "ggplot2movies")
skimr::skim(movies)
## ── Data Summary ────────────────────────
## Values
## Name movies
## Number of rows 58788
## Number of columns 24
## _______________________
## Column type frequency:
## character 2
## numeric 22
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 title 0 1 1 121 0 56007 0
## 2 mpaa 0 1 0 5 53864 5 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 year 0 1 1976. 23.7 1893 1958 1983 1997 2005 ▁▁▃▃▇
## 2 length 0 1 82.3 44.3 1 74 90 100 5220 ▇▁▁▁▁
## 3 budget 53573 0.0887 13412513. 23350085. 0 250000 3000000 15000000 200000000 ▇▁▁▁▁
## 4 rating 0 1 5.93 1.55 1 5 6.1 7 10 ▁▃▇▆▁
## 5 votes 0 1 632. 3830. 5 11 30 112 157608 ▇▁▁▁▁
## 6 r1 0 1 7.01 10.9 0 0 4.5 4.5 100 ▇▁▁▁▁
## 7 r2 0 1 4.02 5.96 0 0 4.5 4.5 84.5 ▇▁▁▁▁
## 8 r3 0 1 4.72 6.45 0 0 4.5 4.5 84.5 ▇▁▁▁▁
## 9 r4 0 1 6.37 7.59 0 0 4.5 4.5 100 ▇▁▁▁▁
## 10 r5 0 1 9.80 9.73 0 4.5 4.5 14.5 100 ▇▁▁▁▁
## 11 r6 0 1 13.0 11.0 0 4.5 14.5 14.5 84.5 ▇▂▁▁▁
## 12 r7 0 1 15.5 11.6 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 13 r8 0 1 13.9 11.3 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 14 r9 0 1 8.95 9.44 0 4.5 4.5 14.5 100 ▇▁▁▁▁
## 15 r10 0 1 16.9 15.7 0 4.5 14.5 24.5 100 ▇▃▁▁▁
## 16 Action 0 1 0.0797 0.271 0 0 0 0 1 ▇▁▁▁▁
## 17 Animation 0 1 0.0628 0.243 0 0 0 0 1 ▇▁▁▁▁
## 18 Comedy 0 1 0.294 0.455 0 0 0 1 1 ▇▁▁▁▃
## 19 Drama 0 1 0.371 0.483 0 0 0 1 1 ▇▁▁▁▅
## 20 Documentary 0 1 0.0591 0.236 0 0 0 0 1 ▇▁▁▁▁
## 21 Romance 0 1 0.0807 0.272 0 0 0 0 1 ▇▁▁▁▁
## 22 Short 0 1 0.161 0.367 0 0 0 0 1 ▇▁▁▁▂
ggplot(movies, aes(length)) +
geom_histogram(color = "white") +
labs(x = "Length of movie (minutes)", y = "Frequency")
ggplot(movies, aes(length)) +
geom_histogram(color = "white") +
labs(x = "Length of movie (minutes)", y = "Frequency") +
scale_x_log10()
movies %>%
filter(length < 180) %>%
ggplot(aes(length)) +
geom_histogram(binwidth = 1, fill = "#795549", color = "black") +
labs(x = "Length of movie (minutes)", y = "Frequency")
9/25

Categorical variables



This lecture is based on Chapter 4 of

Unwin (2015) Graphical Data Analysis with R

10/25

There are two types of categorical variables

11/25

There are two types of categorical variables



Nominal where there is no intrinsic ordering to the categories
E.g. blue, grey, black, white.

11/25

There are two types of categorical variables



Nominal where there is no intrinsic ordering to the categories
E.g. blue, grey, black, white.


Ordinal where there is a clear order to the categories.
E.g. Strongly disagree, disagree, neutral, agree, strongly agree.

11/25

Categorical variables in R

  • In R, categorical variables may be encoded as factors.
    data <- c(2, 2, 1, 1, 3, 3, 3, 1)
    factor(data)
    ## [1] 2 2 1 1 3 3 3 1
    ## Levels: 1 2 3
  • You can easily change the labels of the variables:
    factor(data, labels = c("I", "II", "III"))
    ## [1] II II I I III III III I
    ## Levels: I II III
12/25

Categorical variables in R

  • In R, categorical variables may be encoded as factors.
    data <- c(2, 2, 1, 1, 3, 3, 3, 1)
    factor(data)
    ## [1] 2 2 1 1 3 3 3 1
    ## Levels: 1 2 3
  • You can easily change the labels of the variables:
    factor(data, labels = c("I", "II", "III"))
    ## [1] II II I I III III III I
    ## Levels: I II III
  • Order of the factors are determined by the input:
# numerical input are ordered in increasing order
factor(c(1, 3, 10))
## [1] 1 3 10
## Levels: 1 3 10
# character input are ordered alphabetically
factor(c("1", "3", "10"))
## [1] 1 3 10
## Levels: 1 10 3
# you can specify order of levels explicitly
factor(c("1", "3", "10"),
levels = c("1", "3", "10"))
## [1] 1 3 10
## Levels: 1 3 10
12/25

Numerical factors in R

x <- factor(c(10, 20, 30, 10, 20))
mean(x)
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## [1] NA
13/25

Numerical factors in R

x <- factor(c(10, 20, 30, 10, 20))
mean(x)
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## [1] NA

as.numeric function returns the internal integer values of the factor

mean(as.numeric(x))
## [1] 1.8
13/25

Numerical factors in R

x <- factor(c(10, 20, 30, 10, 20))
mean(x)
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## [1] NA

as.numeric function returns the internal integer values of the factor

mean(as.numeric(x))
## [1] 1.8

You probably want to use:

mean(as.numeric(levels(x)[x]))
## [1] 18
mean(as.numeric(as.character(x)))
## [1] 18
13/25

Revisiting Case study 1 2019 Australian Federal Election

df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"),
skip = 1,
col_types = cols(
.default = col_character(),
OrdinaryVotes = col_double(),
AbsentVotes = col_double(),
ProvisionalVotes = col_double(),
PrePollVotes = col_double(),
PostalVotes = col_double(),
TotalVotes = col_double(),
Swing = col_double()))
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
ggplot(aes(perc_GRN, State)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
labs(x = "Percentage of first preference votes per division",
y = "State",
title = "First preference votes for the Greens party")
tdf3 %>%
mutate(State = fct_reorder(State, perc_GRN)) %>%
ggplot(aes(perc_GRN, State)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
labs(x = "Percentage of first preference votes per division",
y = "State",
title = "First preference votes for the Greens party")
14/25

Order nominal variables meaningfully

Coding tip: use below functions to easily change the order of factor levels

stats::reorder(factor, value, mean)
forcats::fct_reorder(factor, value, median)
forcats::fct_reorder2(factor, value1, value2, func)
15/25

Case study 6 Aspirin use after heart attack

  • Meta-analysis is a statistical analysis that combines the results of multiple scientific studies.
  • This data studies the use of aspirin for death prevention after myocardial infarction, or in plain terms, a heart attack.
  • The ISIS-2 study has more patients than all other studies combined.
  • You could consider lumping the categories with low frequencies together.
data("Fleiss93", package = "meta")
df6 <- Fleiss93 %>%
mutate(total = n.e + n.c)
skimr::skim(df6)
## ── Data Summary ────────────────────────
## Values
## Name df6
## Number of rows 7
## Number of columns 7
## _______________________
## Column type frequency:
## character 1
## numeric 6
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 study 0 1 3 6 0 7 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 year 0 1 1979. 4.39 1974 1978. 1979 1980 1988 ▇▇▇▁▃
## 2 event.e 0 1 304 563. 32 46.5 85 174 1570 ▇▁▁▁▁
## 3 n.e 0 1 2027. 2959. 317 686. 810 1550. 8587 ▇▂▁▁▂
## 4 event.c 0 1 327. 618. 38 58 67 172. 1720 ▇▁▁▁▁
## 5 n.c 0 1 1974. 2993. 309 515 771 1554. 8600 ▇▂▁▁▂
## 6 total 0 1 4000. 5950. 626 1228. 1529 3103 17187 ▇▂▁▁▂
df6 %>%
mutate(study = fct_reorder(study, desc(total))) %>%
ggplot(aes(study, total)) +
geom_col() +
labs(x = "", y = "Frequency") +
guides(x = guide_axis(n.dodge = 2))
df6 %>%
mutate(study = ifelse(total < 2000, "Other", study),
study = fct_reorder(study, desc(total))) %>%
ggplot(aes(study, total)) +
geom_col() +
labs(x = "", y = "Frequency")

Fleiss JL (1993): The statistical basis of meta-analysis. Statistical Methods in Medical Research 2 121–145
Balduzzi S, Rücker G, Schwarzer G (2019), How to perform a meta-analysis with R: a practical tutorial, Evidence-Based Mental Health.

16/25

Consider combining factor levels with low frequencies

Coding tip: the following family of functions help to easily lump factor levels together:

forcats::fct_lump()
forcats::fct_lump_lowfreq()
forcats::fct_lump_min()
forcats::fct_lump_n()
forcats::fct_lump_prop()
# if conditioned on another variable
ifelse(cond, "Other", factor)
dplyr::case_when(cond1 ~ "level1",
cond2 ~ "level2",
TRUE ~ "Other")
17/25

Case study 7 Anorexia

Treatment Frequency
CBT 29
Cont 26
FT 17

Table or Plot?

data(anorexia, package = "MASS")
df9tab <- table(anorexia$Treat) %>%
as.data.frame() %>%
rename(Treatment = Var1, Frequency = Freq)
skimr::skim(anorexia)
## ── Data Summary ────────────────────────
## Values
## Name anorexia
## Number of rows 72
## Number of columns 3
## _______________________
## Column type frequency:
## factor 1
## numeric 2
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 Treat 0 1 FALSE 3 CBT: 29, Con: 26, FT: 17
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Prewt 0 1 82.4 5.18 70 79.6 82.3 86 94.9 ▂▅▇▆▁
## 2 Postwt 0 1 85.2 8.04 71.3 79.3 84.1 91.6 104. ▆▇▅▆▂
ggplot(anorexia, aes(Treat)) +
geom_bar() +
labs(x = "", y = "Frequency")
ggplot(anorexia, aes(Treat)) +
stat_count(geom = "point", size = 4) +
stat_count(geom = "line", group = 1) +
labs(y = "Frequency", x = "")

Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993) A Handbook of Small Data Sets. Chapman & Hall, Data set 285 (p. 229)
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

18/25

Case study 7 Anorexia

Treatment Frequency
CBT 29
Cont 26
FT 17

Table or Plot?

  • Table for accuracy, plot for visual communication
data(anorexia, package = "MASS")
df9tab <- table(anorexia$Treat) %>%
as.data.frame() %>%
rename(Treatment = Var1, Frequency = Freq)
skimr::skim(anorexia)
## ── Data Summary ────────────────────────
## Values
## Name anorexia
## Number of rows 72
## Number of columns 3
## _______________________
## Column type frequency:
## factor 1
## numeric 2
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 Treat 0 1 FALSE 3 CBT: 29, Con: 26, FT: 17
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Prewt 0 1 82.4 5.18 70 79.6 82.3 86 94.9 ▂▅▇▆▁
## 2 Postwt 0 1 85.2 8.04 71.3 79.3 84.1 91.6 104. ▆▇▅▆▂
ggplot(anorexia, aes(Treat)) +
geom_bar() +
labs(x = "", y = "Frequency")
ggplot(anorexia, aes(Treat)) +
stat_count(geom = "point", size = 4) +
stat_count(geom = "line", group = 1) +
labs(y = "Frequency", x = "")

Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993) A Handbook of Small Data Sets. Chapman & Hall, Data set 285 (p. 229)
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

18/25

Case study 7 Anorexia

Treatment Frequency
CBT 29
Cont 26
FT 17

Table or Plot?

  • Table for accuracy, plot for visual communication

Why not a point or line?

data(anorexia, package = "MASS")
df9tab <- table(anorexia$Treat) %>%
as.data.frame() %>%
rename(Treatment = Var1, Frequency = Freq)
skimr::skim(anorexia)
## ── Data Summary ────────────────────────
## Values
## Name anorexia
## Number of rows 72
## Number of columns 3
## _______________________
## Column type frequency:
## factor 1
## numeric 2
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 Treat 0 1 FALSE 3 CBT: 29, Con: 26, FT: 17
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Prewt 0 1 82.4 5.18 70 79.6 82.3 86 94.9 ▂▅▇▆▁
## 2 Postwt 0 1 85.2 8.04 71.3 79.3 84.1 91.6 104. ▆▇▅▆▂
ggplot(anorexia, aes(Treat)) +
geom_bar() +
labs(x = "", y = "Frequency")
ggplot(anorexia, aes(Treat)) +
stat_count(geom = "point", size = 4) +
stat_count(geom = "line", group = 1) +
labs(y = "Frequency", x = "")

Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993) A Handbook of Small Data Sets. Chapman & Hall, Data set 285 (p. 229)
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

18/25

Case study 7 Anorexia

Treatment Frequency
CBT 29
Cont 26
FT 17

Table or Plot?

  • Table for accuracy, plot for visual communication

Why not a point or line?

  • This can be appropriate depending on what you want to communicate
  • A barplot occupies more area compared to a point and the area does a better job of communicating size
  • A line is suggestive of a trend
data(anorexia, package = "MASS")
df9tab <- table(anorexia$Treat) %>%
as.data.frame() %>%
rename(Treatment = Var1, Frequency = Freq)
skimr::skim(anorexia)
## ── Data Summary ────────────────────────
## Values
## Name anorexia
## Number of rows 72
## Number of columns 3
## _______________________
## Column type frequency:
## factor 1
## numeric 2
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 Treat 0 1 FALSE 3 CBT: 29, Con: 26, FT: 17
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Prewt 0 1 82.4 5.18 70 79.6 82.3 86 94.9 ▂▅▇▆▁
## 2 Postwt 0 1 85.2 8.04 71.3 79.3 84.1 91.6 104. ▆▇▅▆▂
ggplot(anorexia, aes(Treat)) +
geom_bar() +
labs(x = "", y = "Frequency")
ggplot(anorexia, aes(Treat)) +
stat_count(geom = "point", size = 4) +
stat_count(geom = "line", group = 1) +
labs(y = "Frequency", x = "")

Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993) A Handbook of Small Data Sets. Chapman & Hall, Data set 285 (p. 229)
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

18/25

Case study 8 Titanic

What does the graphs for each categorical variable tell us?

  • There were more crews than 1st to 3rd class passengers
  • There were far more males on ship; possibly because majority of crew members were male. You can further explore this by constructing two-way tables or graphs that consider both variables.
  • Most passengers were adults.
  • More than two-thirds of passengers died.
df9 <- as_tibble(Titanic)
skimr::skim(df9)
## ── Data Summary ────────────────────────
## Values
## Name df9
## Number of rows 32
## Number of columns 5
## _______________________
## Column type frequency:
## character 4
## numeric 1
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 Class 0 1 3 4 0 4 0
## 2 Sex 0 1 4 6 0 2 0
## 3 Age 0 1 5 5 0 2 0
## 4 Survived 0 1 2 3 0 2 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 n 0 1 68.8 136. 0 0.75 13.5 77 670 ▇▁▁▁▁
df9 %>%
group_by(Class) %>%
summarise(total = sum(n)) %>%
ggplot(aes(Class, total)) +
geom_col(fill = "#ee64a4") +
labs(x = "", y = "Frequency")
df9 %>%
group_by(Sex) %>%
summarise(total = sum(n)) %>%
ggplot(aes(Sex, total)) +
geom_col(fill = "#746FB2") +
labs(x = "", y = "Frequency")
df9 %>%
group_by(Age) %>%
summarise(total = sum(n)) %>%
ggplot(aes(Age, total)) +
geom_col(fill = "#C8008F") +
labs(x = "", y = "Frequency")
df9 %>%
group_by(Survived) %>%
summarise(total = sum(n)) %>%
ggplot(aes(Survived, total)) +
geom_col(fill = "#795549") +
labs(x = "Survived", y = "Frequency")

British Board of Trade (1990), Report on the Loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing

19/25

Coloring bars

20/25

Coloring bars

  • Colour here doesn't add information as the x-axis already tells us about the categories, but colouring bars can make it more visually appealing.
  • If you have too many categories colour won't work well to differentiate the categories.
20/25

Case study 9 Opinion poll in Ireland Aug 2013

  • Pie chart is popular in mainstream media but are not generally recommended as people are generally poor at comparing angles.
  • 3D pie charts should definitely be avoided!
  • Here you can see that there are many people that are "Undecided" for which political party to support and failing to account for this paints a different picture.
df9 <- tibble(party = c("Fine Gael", "Labour", "Fianna Fail",
"Sinn Fein", "Indeps", "Green", "Undecided"),
nos = c(181, 51, 171, 119, 91, 4, 368))
df9v2 <- df9 %>% filter(party != "Undecided")
df9
## # A tibble: 7 x 2
## party nos
## <chr> <dbl>
## 1 Fine Gael 181
## 2 Labour 51
## 3 Fianna Fail 171
## 4 Sinn Fein 119
## 5 Indeps 91
## 6 Green 4
## 7 Undecided 368
g9 <- df9 %>%
ggplot(aes("", nos, fill = party)) +
geom_col(color = "black") +
labs(y = "", x = "") +
coord_polar("y") +
theme(axis.line = element_blank(),
axis.line.y = element_blank(),
axis.text = element_blank(),
panel.grid.major = element_blank()) +
scale_fill_discrete_qualitative(name = "Party")
g9
g9 %+% df9v2 +
# below is needed to keep the same color scheme as before
scale_fill_manual(values = qualitative_hcl(7)[1:6])
21/25

Piechart is a stacked barplot just with a transformed coordinate system

22/25

Piechart is a stacked barplot just with a transformed coordinate system

df <- data.frame(var = c("A", "B", "C"), perc = c(40, 40, 20))
g <- ggplot(df, aes("", perc, fill = var)) +
geom_col()
g

22/25

Piechart is a stacked barplot just with a transformed coordinate system

df <- data.frame(var = c("A", "B", "C"), perc = c(40, 40, 20))
g <- ggplot(df, aes("", perc, fill = var)) +
geom_col()
g

g + coord_polar("y")

22/25

Roseplot is a barplot just with a transformed coordinate system

23/25

Roseplot is a barplot just with a transformed coordinate system

dummy <- data.frame(var = LETTERS[1:20],
n = round(rexp(20, 1/100)))
g <- ggplot(dummy, aes(var, n)) + geom_col(fill = "pink", color = "black")
g

23/25

Roseplot is a barplot just with a transformed coordinate system

dummy <- data.frame(var = LETTERS[1:20],
n = round(rexp(20, 1/100)))
g <- ggplot(dummy, aes(var, n)) + geom_col(fill = "pink", color = "black")
g

g + coord_polar("x") + theme_void()

23/25

Take away messages

24/25

Take away messages

  • Again, be prepared to do multiple plots
24/25

Take away messages

  • Again, be prepared to do multiple plots
  • Changing bins or bandwidth in histogram, violin or density plots can paint a different picture
24/25

Take away messages

  • Again, be prepared to do multiple plots
  • Changing bins or bandwidth in histogram, violin or density plots can paint a different picture
  • Consider different representations of categorical variables (reordering meaningfully, lumping low frequencies together, plot or table, pie or barplot, missing categories)
24/25

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 4 - Session 2


25/25

ETC5521: Exploratory Data Analysis


Working with a single variable, making transformations, detecting outliers, using robust statistics

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 4 - Session 2


1/25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow