class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-04B.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-01.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Working with a single variable, making transformations, detecting outliers, using robust statistics</h2> .bottom_abs.width100[ Lecturer: *Emi Tanaka* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 4 - Session 2 <br> ] --- class: transition # Bins and Bandwidths --- # .orange[Case study] .circle.bg-orange.white[3] Boston housing data .f4[Part 1/4] .panelset[ .panel[.panel-name[π] .grid[ .item[ <img src="images/week4B/boston-plot1-1.png" width="460.8" style="display: block; margin: auto;" /> ] .item[ {{content}} ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data(bostonc, package = "DAAG") df3 <- read_tsv(bostonc[10:length(bostonc)]) skimr::skim(df3) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name df3 ## Number of rows 506 ## Number of columns 21 ## _______________________ ## Column type frequency: ## character 2 ## numeric 19 ## ________________________ ## Group variables None ## ## ββ Variable type: character ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 TOWN 0 1 4 23 0 92 0 ## 2 TRACT 0 1 4 4 0 506 0 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 βββββ ## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 β ββ ββ ## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 βββββ ## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 βββββ ## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 βββ ββ ## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 βββ ββ ## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 βββββ ## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 βββββ ## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 βββββ ## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 βββββ ## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ββββ β ## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 βββββ ## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 βββββ ## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ββ βββ ## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 βββββ ## 16 TAX 0 1 408. 169. 187 279 330 666 711 βββββ ## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 βββ β β ## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. βββββ ## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 βββ ββ ``` ]] .panel[.panel-name[R] .f5[ ```r ggplot(df3, aes(MEDV)) + geom_histogram(binwidth = 1, color = "black", fill = "#008A25") + labs(x = "Median housing value (US$1000)", y = "Frequency") ``` ]] ] .footnote.f6[ Harrison, David, and Daniel L. Rubinfeld (1978) Hedonic Housing Prices and the Demand for Clean Air, *Journal of Environmental Economics and Management* **5** 81-102. Original data.<br> Gilley, O.W. and R. Kelley Pace (1996) On the Harrison and Rubinfeld Data. *Journal of Environmental Economics and Management* **31** 403-405. Provided corrections and examined censoring.<br> Maindonald, John H. and Braun, W. John (2020). DAAG: Data Analysis and Graphics Data and Functions. R package version 1.24 ] -- * Thre is a large frequency in the final bin. * There is a decline in observations in the $40-49K range as well as dip in observations around $26K and $34K. * The histogram is using a bin width of 1 unit and is **left-open** (or **right-closed**): (4.5, 5.5], (5.5, 6.5] ... (49.5, 50.5]. * Occasionally, whether it is left- or right-open can make a difference. --- # .orange[Case study] .circle.bg-orange.white[3] Boston housing data .f4[Part 2/4] .panelset[ .panel[.panel-name[π] .grid[ .item[ <img src="images/week4B/boston-plot2-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week4B/boston-plot3-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week4B/boston-plot4-1.png" width="432" style="display: block; margin: auto;" /> ] .item[ * Density plots depend on the bandwidth chosen and more than often do not estimate well at boundary cases * There are various way to present features of the data using a plot and what works for one person, may not be as straightforward for another * Be prepared to do multiple plots! ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data(bostonc, package = "DAAG") df3 <- read_tsv(bostonc[10:length(bostonc)]) skimr::skim(df3) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name df3 ## Number of rows 506 ## Number of columns 21 ## _______________________ ## Column type frequency: ## character 2 ## numeric 19 ## ________________________ ## Group variables None ## ## ββ Variable type: character ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 TOWN 0 1 4 23 0 92 0 ## 2 TRACT 0 1 4 4 0 506 0 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 βββββ ## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 β ββ ββ ## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 βββββ ## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 βββββ ## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 βββ ββ ## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 βββ ββ ## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 βββββ ## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 βββββ ## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 βββββ ## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 βββββ ## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ββββ β ## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 βββββ ## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 βββββ ## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ββ βββ ## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 βββββ ## 16 TAX 0 1 408. 169. 187 279 330 666 711 βββββ ## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 βββ β β ## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. βββββ ## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 βββ ββ ``` ]] .panel[.panel-name[R] .f4[ ```r ggplot(df3, aes(MEDV, y = "")) + geom_boxplot(fill = "#008A25") + labs(x = "Median housing value (US$1000)", y = "") + theme(axis.line.y = element_blank()) ggplot(df3, aes(MEDV, y = "")) + geom_jitter() + labs(x = "Median housing value (US$1000)", y = "") + theme(axis.line.y = element_blank()) ggplot(df3, aes(MEDV)) + geom_density() + geom_rug() + labs(x = "Median housing value (US$1000)", y = "") + theme(axis.line.y = element_blank()) ``` ] ] ] --- # .orange[Case study] .circle.bg-orange.white[3] Boston housing data .f4[Part 3/4] .panelset[ .panel[.panel-name[π] .grid[ .item[ <img src="images/week4B/boston-plot5-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week4B/boston-plot6-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week4B/boston-plot7-1.png" width="432" style="display: block; margin: auto;" /> ] .item[ <img src="images/week4B/boston-plot8-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week4B/boston-plot9-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week4B/boston-plot10-1.png" width="432" style="display: block; margin: auto;" /> ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data(bostonc, package = "DAAG") df3 <- read_tsv(bostonc[10:length(bostonc)]) skimr::skim(df3) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name df3 ## Number of rows 506 ## Number of columns 21 ## _______________________ ## Column type frequency: ## character 2 ## numeric 19 ## ________________________ ## Group variables None ## ## ββ Variable type: character ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 TOWN 0 1 4 23 0 92 0 ## 2 TRACT 0 1 4 4 0 506 0 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 βββββ ## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 β ββ ββ ## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 βββββ ## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 βββββ ## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 βββ ββ ## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 βββ ββ ## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 βββββ ## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 βββββ ## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 βββββ ## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 βββββ ## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ββββ β ## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 βββββ ## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 βββββ ## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ββ βββ ## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 βββββ ## 16 TAX 0 1 408. 169. 187 279 330 666 711 βββββ ## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 βββ β β ## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. βββββ ## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 βββ ββ ``` ]] .panel[.panel-name[R] .f4.scroll-sign[.s500[ ```r ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.2) + labs(x = "Pupil-teacher ratio by town", y = "", title = "Bin width = 0.2, Left-open") ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.5) + labs(x = "Pupil-teacher ratio by town", y = "", title = "Bin width = 0.5, Left-open") ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", bins = 30) + labs(x = "Pupil-teacher ratio by town", y = "", title = "Bin number = 30, Left-open") ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.2, closed = "left") + labs(x = "Pupil-teacher ratio by town", y = "", title = "Bin width = 0.2, Right-open") ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.5, closed = "left") + labs(x = "Pupil-teacher ratio by town", y = "", title = "Bin width = 0.5, Right-open") ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", bins = 30, closed = "left") + labs(x = "Pupil-teacher ratio by town", y = "", title = "Bin number = 30, Right-open") ``` ]]] ] --- # .orange[Case study] .circle.bg-orange.white[3] Boston housing data .f4[Part 4/4] .panelset[ .panel[.panel-name[π] .grid[ .item[ <img src="images/week4B/boston-plotx-1.png" width="576" style="display: block; margin: auto;" /> ] .item.f4[ * CRIM: per capita crime rate by town * INDUS: proportion of non-retail business acres per town * NOX: nitrogen oxides concentration (parts per 10 million) * RM: average number of room per dwelling * AGE: proportion of owner-occupied units built prior to 1940 * DIS: weighted mean of distances to 5 Boston employment centres * RAD: index of accessibility to radial highways * TAX: full-value property tax rate per $10K * PTRATIO: pupil-teacher ratio by town * LSTAT: lower status of the population (%) * MEDV: median value of owner-occupied homes in $1000s ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r df3long <- df3 %>% pivot_longer(MEDV:LSTAT, names_to = "var", values_to = "value") %>% filter(!var %in% c("CHAS", "B", "ZN")) skimr::skim(df3long) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name df3long ## Number of rows 6072 ## Number of columns 8 ## _______________________ ## Column type frequency: ## character 3 ## numeric 5 ## ________________________ ## Group variables None ## ## ββ Variable type: character ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 TOWN 0 1 4 23 0 92 0 ## 2 TRACT 0 1 4 4 0 506 0 ## 3 var 0 1 2 7 0 12 0 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OBS. 0 1 254. 146. 1 127 254. 380 506 βββββ ## 2 TOWN# 0 1 47.5 27.5 0 26 42 78 91 β ββ ββ ## 3 LON 0 1 -71.1 0.0753 -71.3 -71.1 -71.1 -71.0 -70.8 βββββ ## 4 LAT 0 1 42.2 0.0617 42.0 42.2 42.2 42.3 42.4 βββββ ## 5 value 0 1 49.0 120. 0.00632 4 12.3 23.4 711 βββββ ``` ]] .panel[.panel-name[R] .f4[ ```r ggplot(df3long, aes(value)) + geom_histogram(color = "white") + facet_wrap( ~var, scale = "free") + labs(x = "", y = "") + theme(axis.text = element_text(size = 12)) ``` ] ] ] --- # .orange[Case study] .circle.bg-orange.white[4] Hidalgo stamps thickness .panelset[ .panel[.panel-name[π] <img src="images/week4B/hidalgo-plot-1.png" width="576" style="display: block; margin: auto;" /> * A stamp collector, Walton von Winkle, bought several collections of Mexican stamps from 1872-1874 and measured the thickness of all of them. * The different bandwidth for the density plot suggest either that there are two or seven modes. ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r load(here::here("data/Hidalgo1872.rda")) skimr::skim(Hidalgo1872) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name Hidalgo1872 ## Number of rows 485 ## Number of columns 3 ## _______________________ ## Column type frequency: ## numeric 3 ## ________________________ ## Group variables None ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 thickness 0 1 0.0860 0.0150 0.06 0.075 0.08 0.098 0.131 β ββββ ## 2 thicknessA 195 0.598 0.0922 0.0162 0.068 0.0772 0.092 0.105 0.131 βββββ ## 3 thicknessB 289 0.404 0.0768 0.00508 0.06 0.072 0.078 0.08 0.097 βββββ ``` ]] .panel[.panel-name[R] .f4[ ```r ggplot(Hidalgo1872, aes(thickness)) + geom_histogram(binwidth = 0.001, aes(y = stat(density))) + labs(x = "Thickness (0.001 mm)", y = "Density") + geom_density(color = "#E16A86", size = 2) + geom_density(color = "#00AD9A", size = 2, bw = "SJ") ``` ] ] ] --- class: transition # Focus --- # .orange[Case study] .circle.bg-orange.white[5] Movie length .panelset[ .panel[.panel-name[π] .grid[ .item[ <img src="images/week4B/movies-plot1-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week4B/movies-plot2-1.png" width="381.6" style="display: block; margin: auto;" /> ] .item[ {{content}} ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data(movies, package = "ggplot2movies") skimr::skim(movies) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name movies ## Number of rows 58788 ## Number of columns 24 ## _______________________ ## Column type frequency: ## character 2 ## numeric 22 ## ________________________ ## Group variables None ## ## ββ Variable type: character ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 title 0 1 1 121 0 56007 0 ## 2 mpaa 0 1 0 5 53864 5 0 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 year 0 1 1976. 23.7 1893 1958 1983 1997 2005 βββββ ## 2 length 0 1 82.3 44.3 1 74 90 100 5220 βββββ ## 3 budget 53573 0.0887 13412513. 23350085. 0 250000 3000000 15000000 200000000 βββββ ## 4 rating 0 1 5.93 1.55 1 5 6.1 7 10 βββββ ## 5 votes 0 1 632. 3830. 5 11 30 112 157608 βββββ ## 6 r1 0 1 7.01 10.9 0 0 4.5 4.5 100 βββββ ## 7 r2 0 1 4.02 5.96 0 0 4.5 4.5 84.5 βββββ ## 8 r3 0 1 4.72 6.45 0 0 4.5 4.5 84.5 βββββ ## 9 r4 0 1 6.37 7.59 0 0 4.5 4.5 100 βββββ ## 10 r5 0 1 9.80 9.73 0 4.5 4.5 14.5 100 βββββ ## 11 r6 0 1 13.0 11.0 0 4.5 14.5 14.5 84.5 βββββ ## 12 r7 0 1 15.5 11.6 0 4.5 14.5 24.5 100 βββββ ## 13 r8 0 1 13.9 11.3 0 4.5 14.5 24.5 100 βββββ ## 14 r9 0 1 8.95 9.44 0 4.5 4.5 14.5 100 βββββ ## 15 r10 0 1 16.9 15.7 0 4.5 14.5 24.5 100 βββββ ## 16 Action 0 1 0.0797 0.271 0 0 0 0 1 βββββ ## 17 Animation 0 1 0.0628 0.243 0 0 0 0 1 βββββ ## 18 Comedy 0 1 0.294 0.455 0 0 0 1 1 βββββ ## 19 Drama 0 1 0.371 0.483 0 0 0 1 1 βββββ ## 20 Documentary 0 1 0.0591 0.236 0 0 0 0 1 βββββ ## 21 Romance 0 1 0.0807 0.272 0 0 0 0 1 βββββ ## 22 Short 0 1 0.161 0.367 0 0 0 0 1 βββββ ``` ]] .panel[.panel-name[R] .f4[ ```r ggplot(movies, aes(length)) + geom_histogram(color = "white") + labs(x = "Length of movie (minutes)", y = "Frequency") ggplot(movies, aes(length)) + geom_histogram(color = "white") + labs(x = "Length of movie (minutes)", y = "Frequency") + scale_x_log10() movies %>% filter(length < 180) %>% ggplot(aes(length)) + geom_histogram(binwidth = 1, fill = "#795549", color = "black") + labs(x = "Length of movie (minutes)", y = "Frequency") ``` ] ] ] -- * Upon further exploration, you can find the two movies that are well over 16 hours long are "<i>Cure for Insomnia</i>", "<i>Four Stars</i>", and "<i>Longest Most Meaningless Movie in the World</i>" {{content}} -- * We can restrict our attention to films under 3 hours: <img src="images/week4B/movies-plot3-1.png" width="648" style="display: block; margin: auto;" /> {{content}} -- * Notice that there is a peak at particular times. Why do you think so? --- class: transition middle # Categorical variables <br><br> This lecture is based on Chapter 4 of <br><br>Unwin (2015) Graphical Data Analysis with R --- class: middle # There are two types of categorical variables -- <br><br> .monash-blue[**Nominal**] where there is no intrinsic ordering to the categories<br> **E.g.** blue, grey, black, white. -- <br> .monash-blue[**Ordinal**] where there is a clear order to the categories.<Br> **E.g.** Strongly disagree, disagree, neutral, agree, strongly agree. --- # Categorical variables in R .grid[ .item.br[ * In R, categorical variables may be encoded as **factors**. .f4[ ```r data <- c(2, 2, 1, 1, 3, 3, 3, 1) factor(data) ``` ``` ## [1] 2 2 1 1 3 3 3 1 ## Levels: 1 2 3 ``` ] * You can easily change the labels of the variables: .f4[ ```r factor(data, labels = c("I", "II", "III")) ``` ``` ## [1] II II I I III III III I ## Levels: I II III ``` ] ] .item.f4[ {{content}} ] ] -- * Order of the factors are determined by the input: ```r *# numerical input are ordered in increasing order factor(c(1, 3, 10)) ``` ``` ## [1] 1 3 10 ## Levels: 1 3 10 ``` ```r *# character input are ordered alphabetically factor(c("1", "3", "10")) ``` ``` ## [1] 1 3 10 ## Levels: 1 10 3 ``` ```r *# you can specify order of levels explicitly factor(c("1", "3", "10"), levels = c("1", "3", "10")) ``` ``` ## [1] 1 3 10 ## Levels: 1 3 10 ``` --- # Numerical factors in R ```r x <- factor(c(10, 20, 30, 10, 20)) mean(x) ``` ``` ## Warning in mean.default(x): argument is not numeric or logical: returning NA ``` ``` ## [1] NA ``` -- <i class="fas fa-exclamation-triangle"></i> `as.numeric` function returns the internal integer values of the factor ```r mean(as.numeric(x)) ``` ``` ## [1] 1.8 ``` -- You probably want to use: .flex[ .w-50[ ```r mean(as.numeric(levels(x)[x])) ``` ``` ## [1] 18 ``` ] .w-50[ ```r mean(as.numeric(as.character(x))) ``` ``` ## [1] 18 ``` ] ] --- # .orange[Revisiting Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .panelset[ .panel[.panel-name[π] .flex[ .w-50[ <img src="images/week4B/aus-election-plot1-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ <img src="images/week4B/aus-election-plot2-1.png" width="432" style="display: block; margin: auto;" /> ] ] ] .panel[.panel-name[data] .scroll-sign[.s400.f4[ ```r df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"), skip = 1, col_types = cols( .default = col_character(), OrdinaryVotes = col_double(), AbsentVotes = col_double(), ProvisionalVotes = col_double(), PrePollVotes = col_double(), PostalVotes = col_double(), TotalVotes = col_double(), Swing = col_double())) tdf3 <- df1 %>% group_by(DivisionID) %>% summarise(DivisionNm = unique(DivisionNm), State = unique(StateAb), votes_GRN = TotalVotes[which(PartyAb=="GRN")], votes_total = sum(TotalVotes)) %>% mutate(perc_GRN = votes_GRN / votes_total * 100) ``` ]]] .panel[.panel-name[R] .f4[ ```r tdf3 %>% ggplot(aes(perc_GRN, State)) + ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) + labs(x = "Percentage of first preference votes per division", y = "State", title = "First preference votes for the Greens party") tdf3 %>% mutate(State = fct_reorder(State, perc_GRN)) %>% ggplot(aes(perc_GRN, State)) + ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) + labs(x = "Percentage of first preference votes per division", y = "State", title = "First preference votes for the Greens party") ``` ]] ] --- class: middle # Order nominal variables meaningfully <i class="fas fa-code"></i> **Coding tip**: use below functions to easily change the order of factor levels ```r stats::reorder(factor, value, mean) forcats::fct_reorder(factor, value, median) forcats::fct_reorder2(factor, value1, value2, func) ``` --- # .orange[Case study] .circle.bg-orange.white[6] Aspirin use after heart attack .panelset[ .panel[.panel-name[π] .grid[ .item[ <img src="images/week4B/meta-plot1-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week4B/meta-plot2-1.png" width="432" style="display: block; margin: auto;" /> ] .item[ * Meta-analysis is a statistical analysis that combines the results of multiple scientific studies. * This data studies the use of aspirin for death prevention after myocardial infarction, or in plain terms, a heart attack. * The ISIS-2 study has more patients than all other studies combined. * You could consider lumping the categories with low frequencies together. ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data("Fleiss93", package = "meta") df6 <- Fleiss93 %>% mutate(total = n.e + n.c) skimr::skim(df6) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name df6 ## Number of rows 7 ## Number of columns 7 ## _______________________ ## Column type frequency: ## character 1 ## numeric 6 ## ________________________ ## Group variables None ## ## ββ Variable type: character ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 study 0 1 3 6 0 7 0 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 year 0 1 1979. 4.39 1974 1978. 1979 1980 1988 βββββ ## 2 event.e 0 1 304 563. 32 46.5 85 174 1570 βββββ ## 3 n.e 0 1 2027. 2959. 317 686. 810 1550. 8587 βββββ ## 4 event.c 0 1 327. 618. 38 58 67 172. 1720 βββββ ## 5 n.c 0 1 1974. 2993. 309 515 771 1554. 8600 βββββ ## 6 total 0 1 4000. 5950. 626 1228. 1529 3103 17187 βββββ ``` ]] .panel[.panel-name[R] .f4[ ```r df6 %>% mutate(study = fct_reorder(study, desc(total))) %>% ggplot(aes(study, total)) + geom_col() + labs(x = "", y = "Frequency") + guides(x = guide_axis(n.dodge = 2)) df6 %>% mutate(study = ifelse(total < 2000, "Other", study), study = fct_reorder(study, desc(total))) %>% ggplot(aes(study, total)) + geom_col() + labs(x = "", y = "Frequency") ``` ]] ] .f5.footnote[ Fleiss JL (1993): The statistical basis of meta-analysis. *Statistical Methods in Medical Research* **2** 121β145<br> Balduzzi S, RΓΌcker G, Schwarzer G (2019), How to perform a meta-analysis with R: a practical tutorial, Evidence-Based Mental Health. ] --- class: nostripheader middle # Consider combining factor levels with low frequencies <i class="fas fa-code"></i> **Coding tip**: the following family of functions help to easily lump factor levels together: ```r forcats::fct_lump() forcats::fct_lump_lowfreq() forcats::fct_lump_min() forcats::fct_lump_n() forcats::fct_lump_prop() # if conditioned on another variable ifelse(cond, "Other", factor) dplyr::case_when(cond1 ~ "level1", cond2 ~ "level2", TRUE ~ "Other") ``` --- # .orange[Case study] .circle.bg-orange.white[7] Anorexia .panelset[ .panel[.panel-name[π] .grid[ .item[ <img src="images/week4B/anorexia-plot1-1.png" width="432" style="display: block; margin: auto;" /> <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Treatment </th> <th style="text-align:right;"> Frequency </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> CBT </td> <td style="text-align:right;"> 29 </td> </tr> <tr> <td style="text-align:left;"> Cont </td> <td style="text-align:right;"> 26 </td> </tr> <tr> <td style="text-align:left;"> FT </td> <td style="text-align:right;"> 17 </td> </tr> </tbody> </table> ] .item[ **Table or Plot?** {{content}} ] ] ] .panel[.panel-name[data] .h200.f4.scroll-sign[ ```r data(anorexia, package = "MASS") df9tab <- table(anorexia$Treat) %>% as.data.frame() %>% rename(Treatment = Var1, Frequency = Freq) skimr::skim(anorexia) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name anorexia ## Number of rows 72 ## Number of columns 3 ## _______________________ ## Column type frequency: ## factor 1 ## numeric 2 ## ________________________ ## Group variables None ## ## ββ Variable type: factor βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 Treat 0 1 FALSE 3 CBT: 29, Con: 26, FT: 17 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Prewt 0 1 82.4 5.18 70 79.6 82.3 86 94.9 ββ βββ ## 2 Postwt 0 1 85.2 8.04 71.3 79.3 84.1 91.6 104. βββ ββ ``` ]] .panel[.panel-name[R] .f4[ ```r ggplot(anorexia, aes(Treat)) + geom_bar() + labs(x = "", y = "Frequency") ``` ```r ggplot(anorexia, aes(Treat)) + stat_count(geom = "point", size = 4) + stat_count(geom = "line", group = 1) + labs(y = "Frequency", x = "") ``` ]] ] .f6.footnote[ Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993) A Handbook of Small Data Sets. Chapman & Hall, Data set 285 (p. 229) <br> Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0 ] -- * Table for accuracy, plot for visual communication {{content}} -- **Why not a point or line?** <img src="images/week4B/anorexia-plot2-1.png" width="432" style="display: block; margin: auto;" /> {{content}} -- <div class="f4"> <ul> <li>This can be appropriate depending on what you want to communicate </li> <li>A barplot occupies more area compared to a point and the area does a better job of communicating size</li> <li>A line is suggestive of a trend </li> </ul> </div> --- # .orange[Case study] .circle.bg-orange.white[8] Titanic .panelset[ .panel[.panel-name[π] .flex[ .w-50[ .flex[ .w-50[ <img src="images/week4B/titanic-plot1-1.png" width="288" style="display: block; margin: auto;" /> <img src="images/week4B/titanic-plot2-1.png" width="216" style="display: block; margin: auto;" /> ] .w-50[ <img src="images/week4B/titanic-plot3-1.png" width="216" style="display: block; margin: auto;" /> <img src="images/week4B/titanic-plot4-1.png" width="216" style="display: block; margin: auto;" /> ] ] ] .w-50[ **What does the graphs for each categorical variable tell us?** * There were more crews than 1st to 3rd class passengers * There were far more males on ship; possibly because majority of crew members were male. You can further explore this by constructing two-way tables or graphs that consider both variables. * Most passengers were adults. * More than two-thirds of passengers died. ] ] ] .panel[.panel-name[data] .h350.f4.scroll-sign[ ```r df9 <- as_tibble(Titanic) skimr::skim(df9) ``` ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name df9 ## Number of rows 32 ## Number of columns 5 ## _______________________ ## Column type frequency: ## character 4 ## numeric 1 ## ________________________ ## Group variables None ## ## ββ Variable type: character ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 Class 0 1 3 4 0 4 0 ## 2 Sex 0 1 4 6 0 2 0 ## 3 Age 0 1 5 5 0 2 0 ## 4 Survived 0 1 2 3 0 2 0 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 n 0 1 68.8 136. 0 0.75 13.5 77 670 βββββ ``` ]] .panel[.panel-name[R] .f4.scroll-sign[.s500[ ```r df9 %>% group_by(Class) %>% summarise(total = sum(n)) %>% ggplot(aes(Class, total)) + geom_col(fill = "#ee64a4") + labs(x = "", y = "Frequency") df9 %>% group_by(Sex) %>% summarise(total = sum(n)) %>% ggplot(aes(Sex, total)) + geom_col(fill = "#746FB2") + labs(x = "", y = "Frequency") df9 %>% group_by(Age) %>% summarise(total = sum(n)) %>% ggplot(aes(Age, total)) + geom_col(fill = "#C8008F") + labs(x = "", y = "Frequency") df9 %>% group_by(Survived) %>% summarise(total = sum(n)) %>% ggplot(aes(Survived, total)) + geom_col(fill = "#795549") + labs(x = "Survived", y = "Frequency") ``` ]] ]] .f5.footnote[ British Board of Trade (1990), Report on the Loss of the βTitanicβ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing ] --- class: nostripheader middle # Coloring bars <img src="images/week4B/unnamed-chunk-19-1.png" width="720" style="display: block; margin: auto;" /> -- * Colour here doesn't add information as the x-axis already tells us about the categories, but colouring bars can make it more visually appealing. * If you have too many categories colour won't work well to differentiate the categories. --- # .orange[Case study] .circle.bg-orange.white[9] Opinion poll in Ireland Aug 2013 .panelset[ .panel[.panel-name[π] .grid[ .item[ <img src="images/week4B/poll-plot1-1.png" width="504" style="display: block; margin: auto;" /> <img src="images/week4B/poll-plot2-1.png" width="504" style="display: block; margin: auto;" /> ] .item[ * Pie chart is popular in mainstream media but are not generally recommended as people are generally poor at comparing angles. * 3D pie charts should definitely be avoided! * Here you can see that there are many people that are "Undecided" for which political party to support and failing to account for this paints a different picture. ] ] ] .panel[.panel-name[data] .f4[ ```r df9 <- tibble(party = c("Fine Gael", "Labour", "Fianna Fail", "Sinn Fein", "Indeps", "Green", "Undecided"), nos = c(181, 51, 171, 119, 91, 4, 368)) df9v2 <- df9 %>% filter(party != "Undecided") df9 ``` ``` ## # A tibble: 7 x 2 ## party nos ## <chr> <dbl> ## 1 Fine Gael 181 ## 2 Labour 51 ## 3 Fianna Fail 171 ## 4 Sinn Fein 119 ## 5 Indeps 91 ## 6 Green 4 ## 7 Undecided 368 ``` ]] .panel[.panel-name[R] .f4[ ```r g9 <- df9 %>% ggplot(aes("", nos, fill = party)) + geom_col(color = "black") + labs(y = "", x = "") + coord_polar("y") + theme(axis.line = element_blank(), axis.line.y = element_blank(), axis.text = element_blank(), panel.grid.major = element_blank()) + scale_fill_discrete_qualitative(name = "Party") g9 g9 %+% df9v2 + # below is needed to keep the same color scheme as before scale_fill_manual(values = qualitative_hcl(7)[1:6]) ``` ]]] --- class: middle # Piechart is a stacked barplot just with a transformed coordinate system -- ```r df <- data.frame(var = c("A", "B", "C"), perc = c(40, 40, 20)) g <- ggplot(df, aes("", perc, fill = var)) + geom_col() g ``` <img src="images/week4B/barplot-1.png" width="216" style="display: block; margin: auto;" /> -- ```r g + coord_polar("y") ``` <img src="images/week4B/piechart-1.png" width="216" style="display: block; margin: auto;" /> --- class: middle # Roseplot is a barplot just with a transformed coordinate system -- ```r dummy <- data.frame(var = LETTERS[1:20], n = round(rexp(20, 1/100))) g <- ggplot(dummy, aes(var, n)) + geom_col(fill = "pink", color = "black") g ``` <img src="images/week4B/nonstacked-barplot-1.png" width="720" style="display: block; margin: auto;" /> -- ```r g + coord_polar("x") + theme_void() ``` <img src="images/week4B/roseplot-1.png" width="216" style="display: block; margin: auto;" /> --- # Take away messages .flex[ .w-70.f2[ <ul class="fa-ul"> {{content}} </ul> ] ] -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>Again, be prepared to do multiple plots</li> {{content}} -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>Changing bins or bandwidth in histogram, violin or density plots can paint a different picture</li> {{content}} -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>Consider different representations of categorical variables (reordering meaningfully, lumping low frequencies together, plot or table, pie or barplot, missing categories)</li> --- background-size: cover class: title-slide background-image: url("images/bg-01.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Emi Tanaka* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 4 - Session 2 <br> ]