Loading [MathJax]/jax/output/HTML-CSS/jax.js
+ - 0:00:00
Notes for current slide
Notes for next slide

These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .


Press the right arrow to progress to the next slide!

1/26

ETC5521: Exploratory Data Analysis


Working with a single variable, making transformations, detecting outliers, using robust statistics

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 4 - Session 1


1/26

Continuous variables

This lecture is partly based on Chapter 3 of

Unwin (2015) Graphical Data Analysis with R

2/26

Possible features of a single continuous variable

Feature Example Description
Asymmetry The distribution is not symmetrical.
Outliers Some observations are that are far from the rest.
Multimodality There are more than one "peak" in the observations.
Gaps Some continuous interval that are contained within the range but no observations exists.
Heaping Some values occur unexpectedly often.
Discretized Only certain values are found, e.g. due to rounding.
Implausible Values outside of plausible or likely range.
3/26

Numerical features of a single contiuous variables

  • A measure of central tendency, e.g. mean, median and mode
4/26

Numerical features of a single contiuous variables

  • A measure of central tendency, e.g. mean, median and mode
  • A measure of dispersion (also called variability or spread), e.g. variance, standard deviation and interquartile range
4/26

Numerical features of a single contiuous variables

  • A measure of central tendency, e.g. mean, median and mode
  • A measure of dispersion (also called variability or spread), e.g. variance, standard deviation and interquartile range
  • There are other measures, e.g. skewness and kurtosis that measures "tailedness", but these are not as common as the measures of first two
4/26

Numerical features of a single contiuous variables

  • A measure of central tendency, e.g. mean, median and mode
  • A measure of dispersion (also called variability or spread), e.g. variance, standard deviation and interquartile range
  • There are other measures, e.g. skewness and kurtosis that measures "tailedness", but these are not as common as the measures of first two
  • The mean is also the first moment and variance, skewness and kurtosis are second, third, and fourth central moments
4/26

Numerical features of a single contiuous variables

  • A measure of central tendency, e.g. mean, median and mode
  • A measure of dispersion (also called variability or spread), e.g. variance, standard deviation and interquartile range
  • There are other measures, e.g. skewness and kurtosis that measures "tailedness", but these are not as common as the measures of first two
  • The mean is also the first moment and variance, skewness and kurtosis are second, third, and fourth central moments

Significance tests or hypothesis tests

  • Testing for H0:μ=μ0 vs. H1:μμ0 (often μ0=0)
  • The t-test is commonly used if the underlying data are believed to be normally distributed
4/26

Case study 1 2019 Australian Federal Election Part 1/8

Context

  • There are 151 seats in the House of Representative for the 2019 Australian federal election
  • The major parties in Australia are:
    • the Coalition, comprising of the:
      • Liberal,
      • Liberal National (Qld),
      • National, and
      • Country Liberal (NT) parties, and
    • the Australian Labor party
  • The Greens party is a small but notable party

Scott Morrison

Bill Shorten

5/26

Case study 1 2019 Australian Federal Election Part 2/8

https://results.aec.gov.au/24310/Website/Downloads/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv

Given this data, what questions would you ask?
02:00

Data source: Australian Electoral Commission. (2019). Federal Elections (website), accessed August 2021. URL: https://results.aec.gov.au/

6/26

Case study 1 2019 Australian Federal Election Part 3/8

What is the number of the seats won in the House of Representatives by parties?

7/26

Case study 1 2019 Australian Federal Election Part 3/8

What is the number of the seats won in the House of Representatives by parties?

Party # of seats
Coalition 77
Liberal 44
Liberal National Party Of Queensland 23
The Nationals 10
Australian Labor Party 68
The Greens 1
Centre Alliance 1
Katter's Australian Party (Kap) 1
Independent 3

What does this table tell you?

df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"),
skip = 1,
col_types = cols(
.default = col_character(),
OrdinaryVotes = col_double(),
AbsentVotes = col_double(),
ProvisionalVotes = col_double(),
PrePollVotes = col_double(),
PostalVotes = col_double(),
TotalVotes = col_double(),
Swing = col_double()))
skimr::skim(df1)
## ── Data Summary ────────────────────────
## Values
## Name df1
## Number of rows 1207
## Number of columns 18
## _______________________
## Column type frequency:
## character 11
## numeric 7
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 StateAb 0 1 2 3 0 8 0
## 2 DivisionID 0 1 3 3 0 151 0
## 3 DivisionNm 0 1 4 15 0 151 0
## 4 CandidateID 0 1 3 5 0 1057 0
## 5 Surname 0 1 2 18 0 890 0
## 6 GivenNm 0 1 1 25 0 613 0
## 7 BallotPosition 0 1 1 3 0 14 0
## 8 Elected 0 1 1 1 0 2 0
## 9 HistoricElected 0 1 1 1 0 2 0
## 10 PartyAb 151 0.875 2 4 0 40 0
## 11 PartyNm 2 0.998 5 61 0 45 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OrdinaryVotes 0 1 10401. 12446. 167 1867 4317 14768. 54535 ▇▁▁▁▁
## 2 AbsentVotes 0 1 511. 569. 13 117 246 711 3287 ▇▂▁▁▁
## 3 ProvisionalVotes 0 1 41.4 51.7 0 8 20 56 444 ▇▁▁▁▁
## 4 PrePollVotes 0 1 514. 607. 11 108. 211 761 5248 ▇▂▁▁▁
## 5 PostalVotes 0 1 1033. 1476. 14 181 317 1216. 9837 ▇▁▁▁▁
## 6 TotalVotes 0 1 12501. 14860. 250 2348 5196 18142 61202 ▇▁▁▁▁
## 7 Swing 0 1 1.07 4.26 -28.1 -0.73 1.21 2.75 43.5 ▁▆▇▁▁
recode_party_names <- c("Australian Labor Party (Northern Territory) Branch" = "Australian Labor Party",
"Labor" = "Australian Labor Party",
"The Greens (Vic)" = "The Greens",
"The Greens (Wa)" = "The Greens",
"Katter's Australian Party (KAP)" = "Katter's Australian Party",
"Country Liberals (Nt)" = "Country Liberals (NT)")
tdf1 <- df1 %>%
filter(Elected == "Y") %>%
mutate(PartyNm = str_to_title(PartyNm),
PartyNm = recode(PartyNm, !!!recode_party_names)) %>%
count(PartyNm, sort = TRUE) %>%
slice(2:4, 1, 8, 6, 7, 5)

Note: tidyverse is expected to be loaded already.

data.frame(PartyNm = "Coalition", n = sum(tdf1$n[1:3])) %>%
rbind(tdf1) %>%
knitr::kable(col.names = c("Party", "# of seats")) %>%
kableExtra::add_indent(2:4) %>%
kableExtra::row_spec(2:4, color = "#C8C8C8") %>%
kableExtra::kable_classic(full_width = FALSE,
font_size = 20)
7/26

Case study 1 2019 Australian Federal Election Part 3/8

What is the number of the seats won in the House of Representatives by parties?

Party # of seats
Coalition 77
Liberal 44
Liberal National Party Of Queensland 23
The Nationals 10
Australian Labor Party 68
The Greens 1
Centre Alliance 1
Katter's Australian Party (Kap) 1
Independent 3

What does this table tell you?

  • The Coalition won the government
  • Labor and Coalition hold majority of the seats in the House of Representatives (lower house)
  • Parties such as The Greens, Centre Alliance and Katter's Australian Party (KAP) won only a single seat
df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"),
skip = 1,
col_types = cols(
.default = col_character(),
OrdinaryVotes = col_double(),
AbsentVotes = col_double(),
ProvisionalVotes = col_double(),
PrePollVotes = col_double(),
PostalVotes = col_double(),
TotalVotes = col_double(),
Swing = col_double()))
skimr::skim(df1)
## ── Data Summary ────────────────────────
## Values
## Name df1
## Number of rows 1207
## Number of columns 18
## _______________________
## Column type frequency:
## character 11
## numeric 7
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 StateAb 0 1 2 3 0 8 0
## 2 DivisionID 0 1 3 3 0 151 0
## 3 DivisionNm 0 1 4 15 0 151 0
## 4 CandidateID 0 1 3 5 0 1057 0
## 5 Surname 0 1 2 18 0 890 0
## 6 GivenNm 0 1 1 25 0 613 0
## 7 BallotPosition 0 1 1 3 0 14 0
## 8 Elected 0 1 1 1 0 2 0
## 9 HistoricElected 0 1 1 1 0 2 0
## 10 PartyAb 151 0.875 2 4 0 40 0
## 11 PartyNm 2 0.998 5 61 0 45 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OrdinaryVotes 0 1 10401. 12446. 167 1867 4317 14768. 54535 ▇▁▁▁▁
## 2 AbsentVotes 0 1 511. 569. 13 117 246 711 3287 ▇▂▁▁▁
## 3 ProvisionalVotes 0 1 41.4 51.7 0 8 20 56 444 ▇▁▁▁▁
## 4 PrePollVotes 0 1 514. 607. 11 108. 211 761 5248 ▇▂▁▁▁
## 5 PostalVotes 0 1 1033. 1476. 14 181 317 1216. 9837 ▇▁▁▁▁
## 6 TotalVotes 0 1 12501. 14860. 250 2348 5196 18142 61202 ▇▁▁▁▁
## 7 Swing 0 1 1.07 4.26 -28.1 -0.73 1.21 2.75 43.5 ▁▆▇▁▁
recode_party_names <- c("Australian Labor Party (Northern Territory) Branch" = "Australian Labor Party",
"Labor" = "Australian Labor Party",
"The Greens (Vic)" = "The Greens",
"The Greens (Wa)" = "The Greens",
"Katter's Australian Party (KAP)" = "Katter's Australian Party",
"Country Liberals (Nt)" = "Country Liberals (NT)")
tdf1 <- df1 %>%
filter(Elected == "Y") %>%
mutate(PartyNm = str_to_title(PartyNm),
PartyNm = recode(PartyNm, !!!recode_party_names)) %>%
count(PartyNm, sort = TRUE) %>%
slice(2:4, 1, 8, 6, 7, 5)

Note: tidyverse is expected to be loaded already.

data.frame(PartyNm = "Coalition", n = sum(tdf1$n[1:3])) %>%
rbind(tdf1) %>%
knitr::kable(col.names = c("Party", "# of seats")) %>%
kableExtra::add_indent(2:4) %>%
kableExtra::row_spec(2:4, color = "#C8C8C8") %>%
kableExtra::kable_classic(full_width = FALSE,
font_size = 20)
7/26

Case study 1 2019 Australian Federal Election Part 3/8

What is the number of the seats won in the House of Representatives by parties?

Party # of seats
Coalition 77
Liberal 44
Liberal National Party Of Queensland 23
The Nationals 10
Australian Labor Party 68
The Greens 1
Centre Alliance 1
Katter's Australian Party (Kap) 1
Independent 3

What does this table tell you?

  • The Coalition won the government
  • Labor and Coalition hold majority of the seats in the House of Representatives (lower house)
  • Parties such as The Greens, Centre Alliance and Katter's Australian Party (KAP) won only a single seat

Only?

df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"),
skip = 1,
col_types = cols(
.default = col_character(),
OrdinaryVotes = col_double(),
AbsentVotes = col_double(),
ProvisionalVotes = col_double(),
PrePollVotes = col_double(),
PostalVotes = col_double(),
TotalVotes = col_double(),
Swing = col_double()))
skimr::skim(df1)
## ── Data Summary ────────────────────────
## Values
## Name df1
## Number of rows 1207
## Number of columns 18
## _______________________
## Column type frequency:
## character 11
## numeric 7
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 StateAb 0 1 2 3 0 8 0
## 2 DivisionID 0 1 3 3 0 151 0
## 3 DivisionNm 0 1 4 15 0 151 0
## 4 CandidateID 0 1 3 5 0 1057 0
## 5 Surname 0 1 2 18 0 890 0
## 6 GivenNm 0 1 1 25 0 613 0
## 7 BallotPosition 0 1 1 3 0 14 0
## 8 Elected 0 1 1 1 0 2 0
## 9 HistoricElected 0 1 1 1 0 2 0
## 10 PartyAb 151 0.875 2 4 0 40 0
## 11 PartyNm 2 0.998 5 61 0 45 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OrdinaryVotes 0 1 10401. 12446. 167 1867 4317 14768. 54535 ▇▁▁▁▁
## 2 AbsentVotes 0 1 511. 569. 13 117 246 711 3287 ▇▂▁▁▁
## 3 ProvisionalVotes 0 1 41.4 51.7 0 8 20 56 444 ▇▁▁▁▁
## 4 PrePollVotes 0 1 514. 607. 11 108. 211 761 5248 ▇▂▁▁▁
## 5 PostalVotes 0 1 1033. 1476. 14 181 317 1216. 9837 ▇▁▁▁▁
## 6 TotalVotes 0 1 12501. 14860. 250 2348 5196 18142 61202 ▇▁▁▁▁
## 7 Swing 0 1 1.07 4.26 -28.1 -0.73 1.21 2.75 43.5 ▁▆▇▁▁
recode_party_names <- c("Australian Labor Party (Northern Territory) Branch" = "Australian Labor Party",
"Labor" = "Australian Labor Party",
"The Greens (Vic)" = "The Greens",
"The Greens (Wa)" = "The Greens",
"Katter's Australian Party (KAP)" = "Katter's Australian Party",
"Country Liberals (Nt)" = "Country Liberals (NT)")
tdf1 <- df1 %>%
filter(Elected == "Y") %>%
mutate(PartyNm = str_to_title(PartyNm),
PartyNm = recode(PartyNm, !!!recode_party_names)) %>%
count(PartyNm, sort = TRUE) %>%
slice(2:4, 1, 8, 6, 7, 5)

Note: tidyverse is expected to be loaded already.

data.frame(PartyNm = "Coalition", n = sum(tdf1$n[1:3])) %>%
rbind(tdf1) %>%
knitr::kable(col.names = c("Party", "# of seats")) %>%
kableExtra::add_indent(2:4) %>%
kableExtra::row_spec(2:4, color = "#C8C8C8") %>%
kableExtra::kable_classic(full_width = FALSE,
font_size = 20)
7/26

Case study 1 2019 Australian Federal Election Part 3/8

What is the number of the seats won in the House of Representatives by parties?

Party # of seats
Coalition 77
Liberal 44
Liberal National Party Of Queensland 23
The Nationals 10
Australian Labor Party 68
The Greens 1
Centre Alliance 1
Katter's Australian Party (Kap) 1
Independent 3

What does this table tell you?

  • The Coalition won the government
  • Labor and Coalition hold majority of the seats in the House of Representatives (lower house)
  • Parties such as The Greens, Centre Alliance and Katter's Australian Party (KAP) won only a single seat

Only? Wait... Did the parties compete in all electoral districts?

df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"),
skip = 1,
col_types = cols(
.default = col_character(),
OrdinaryVotes = col_double(),
AbsentVotes = col_double(),
ProvisionalVotes = col_double(),
PrePollVotes = col_double(),
PostalVotes = col_double(),
TotalVotes = col_double(),
Swing = col_double()))
skimr::skim(df1)
## ── Data Summary ────────────────────────
## Values
## Name df1
## Number of rows 1207
## Number of columns 18
## _______________________
## Column type frequency:
## character 11
## numeric 7
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 StateAb 0 1 2 3 0 8 0
## 2 DivisionID 0 1 3 3 0 151 0
## 3 DivisionNm 0 1 4 15 0 151 0
## 4 CandidateID 0 1 3 5 0 1057 0
## 5 Surname 0 1 2 18 0 890 0
## 6 GivenNm 0 1 1 25 0 613 0
## 7 BallotPosition 0 1 1 3 0 14 0
## 8 Elected 0 1 1 1 0 2 0
## 9 HistoricElected 0 1 1 1 0 2 0
## 10 PartyAb 151 0.875 2 4 0 40 0
## 11 PartyNm 2 0.998 5 61 0 45 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 OrdinaryVotes 0 1 10401. 12446. 167 1867 4317 14768. 54535 ▇▁▁▁▁
## 2 AbsentVotes 0 1 511. 569. 13 117 246 711 3287 ▇▂▁▁▁
## 3 ProvisionalVotes 0 1 41.4 51.7 0 8 20 56 444 ▇▁▁▁▁
## 4 PrePollVotes 0 1 514. 607. 11 108. 211 761 5248 ▇▂▁▁▁
## 5 PostalVotes 0 1 1033. 1476. 14 181 317 1216. 9837 ▇▁▁▁▁
## 6 TotalVotes 0 1 12501. 14860. 250 2348 5196 18142 61202 ▇▁▁▁▁
## 7 Swing 0 1 1.07 4.26 -28.1 -0.73 1.21 2.75 43.5 ▁▆▇▁▁
recode_party_names <- c("Australian Labor Party (Northern Territory) Branch" = "Australian Labor Party",
"Labor" = "Australian Labor Party",
"The Greens (Vic)" = "The Greens",
"The Greens (Wa)" = "The Greens",
"Katter's Australian Party (KAP)" = "Katter's Australian Party",
"Country Liberals (Nt)" = "Country Liberals (NT)")
tdf1 <- df1 %>%
filter(Elected == "Y") %>%
mutate(PartyNm = str_to_title(PartyNm),
PartyNm = recode(PartyNm, !!!recode_party_names)) %>%
count(PartyNm, sort = TRUE) %>%
slice(2:4, 1, 8, 6, 7, 5)

Note: tidyverse is expected to be loaded already.

data.frame(PartyNm = "Coalition", n = sum(tdf1$n[1:3])) %>%
rbind(tdf1) %>%
knitr::kable(col.names = c("Party", "# of seats")) %>%
kableExtra::add_indent(2:4) %>%
kableExtra::row_spec(2:4, color = "#C8C8C8") %>%
kableExtra::kable_classic(full_width = FALSE,
font_size = 20)
7/26

Case study 1 2019 Australian Federal Election Part 4/8

What do you notice from this table?

tdf2 <- df1 %>%
mutate(PartyNm = str_to_title(PartyNm),
PartyNm = recode(PartyNm, !!!recode_party_names)) %>%
count(PartyNm, sort = TRUE)

You can omit table_options and toggle_select or have a look at the source Rmd to find out what it is

tdf2 %>%
DT::datatable(rownames = FALSE,
escape = FALSE,
width = "500px",
options = table_options(scrollY = "400px",
title = "Australian Federal Election 2019 - Party Distribution",
csv = "aus-election-2019-party-dist"),
elementId = "tab1B",
colnames = c("Party", "# of electorates"),
callback = toggle_select)
8/26

Case study 1 2019 Australian Federal Election Part 4/8

What do you notice from this table?

  • The Greens are represented in every electoral districts
  • United Australia Party is the only other non-major party to be represented in every electoral district
  • KAP is represented in 7 electoral districts
  • Centre Alliance is only represented in 3 electoral districts!
tdf2 <- df1 %>%
mutate(PartyNm = str_to_title(PartyNm),
PartyNm = recode(PartyNm, !!!recode_party_names)) %>%
count(PartyNm, sort = TRUE)

You can omit table_options and toggle_select or have a look at the source Rmd to find out what it is

tdf2 %>%
DT::datatable(rownames = FALSE,
escape = FALSE,
width = "500px",
options = table_options(scrollY = "400px",
title = "Australian Federal Election 2019 - Party Distribution",
csv = "aus-election-2019-party-dist"),
elementId = "tab1B",
colnames = c("Party", "# of electorates"),
callback = toggle_select)
8/26

Case study 1 2019 Australian Federal Election Part 4/8

What do you notice from this table?

  • The Greens are represented in every electoral districts
  • United Australia Party is the only other non-major party to be represented in every electoral district
  • KAP is represented in 7 electoral districts
  • Centre Alliance is only represented in 3 electoral districts!

Let's have a closer look at the Greens party...

tdf2 <- df1 %>%
mutate(PartyNm = str_to_title(PartyNm),
PartyNm = recode(PartyNm, !!!recode_party_names)) %>%
count(PartyNm, sort = TRUE)

You can omit table_options and toggle_select or have a look at the source Rmd to find out what it is

tdf2 %>%
DT::datatable(rownames = FALSE,
escape = FALSE,
width = "500px",
options = table_options(scrollY = "400px",
title = "Australian Federal Election 2019 - Party Distribution",
csv = "aus-election-2019-party-dist"),
elementId = "tab1B",
colnames = c("Party", "# of electorates"),
callback = toggle_select)
8/26

Case study 1 2019 Australian Federal Election Part 5/8

What does this graph tell you?

tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
skimr::skim(tdf3)
## ── Data Summary ────────────────────────
## Values
## Name tdf3
## Number of rows 151
## Number of columns 6
## _______________________
## Column type frequency:
## character 3
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 DivisionID 0 1 3 3 0 151 0
## 2 DivisionNm 0 1 4 15 0 151 0
## 3 State 0 1 2 3 0 8 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 votes_GRN 0 1 9821. 5581. 2744 6555 8676 11532. 45876 ▇▂▁▁▁
## 2 votes_total 0 1 99925. 9801. 51009 96372. 100936 105588 116216 ▁▁▁▇▅
## 3 perc_GRN 0 1 9.87 5.63 2.89 6.43 8.55 11.4 47.8 ▇▂▁▁▁
tdf3 %>%
ggplot(aes(perc_GRN)) +
geom_histogram(color = "white", fill = "#00843D") +
labs(x = "Percentage of first preference votes per division",
y = "Count",
title = "First preference votes for the Greens party")
9/26
  • Australia uses full-preference instant-runoff voting in single member seats
  • Following the full allocation of preferences, it is possible to derive a two-party-preferred figure, where the votes have been allocated between the two main candidates in the election.
  • In Australia, this is usually between the candidates from the Coalition parties and the Australian Labor Party.

Case study 1 2019 Australian Federal Election Part 5/8

What does this graph tell you?

  • Majority of the country does not have first preference for the Greens
  • Some constituents are slightly more supportive than the others
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
skimr::skim(tdf3)
## ── Data Summary ────────────────────────
## Values
## Name tdf3
## Number of rows 151
## Number of columns 6
## _______________________
## Column type frequency:
## character 3
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 DivisionID 0 1 3 3 0 151 0
## 2 DivisionNm 0 1 4 15 0 151 0
## 3 State 0 1 2 3 0 8 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 votes_GRN 0 1 9821. 5581. 2744 6555 8676 11532. 45876 ▇▂▁▁▁
## 2 votes_total 0 1 99925. 9801. 51009 96372. 100936 105588 116216 ▁▁▁▇▅
## 3 perc_GRN 0 1 9.87 5.63 2.89 6.43 8.55 11.4 47.8 ▇▂▁▁▁
tdf3 %>%
ggplot(aes(perc_GRN)) +
geom_histogram(color = "white", fill = "#00843D") +
labs(x = "Percentage of first preference votes per division",
y = "Count",
title = "First preference votes for the Greens party")
9/26
  • Australia uses full-preference instant-runoff voting in single member seats
  • Following the full allocation of preferences, it is possible to derive a two-party-preferred figure, where the votes have been allocated between the two main candidates in the election.
  • In Australia, this is usually between the candidates from the Coalition parties and the Australian Labor Party.

Case study 1 2019 Australian Federal Election Part 5/8

What does this graph tell you?

  • Majority of the country does not have first preference for the Greens
  • Some constituents are slightly more supportive than the others

What further questions would you ask?

02:00
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
skimr::skim(tdf3)
## ── Data Summary ────────────────────────
## Values
## Name tdf3
## Number of rows 151
## Number of columns 6
## _______________________
## Column type frequency:
## character 3
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 DivisionID 0 1 3 3 0 151 0
## 2 DivisionNm 0 1 4 15 0 151 0
## 3 State 0 1 2 3 0 8 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 votes_GRN 0 1 9821. 5581. 2744 6555 8676 11532. 45876 ▇▂▁▁▁
## 2 votes_total 0 1 99925. 9801. 51009 96372. 100936 105588 116216 ▁▁▁▇▅
## 3 perc_GRN 0 1 9.87 5.63 2.89 6.43 8.55 11.4 47.8 ▇▂▁▁▁
tdf3 %>%
ggplot(aes(perc_GRN)) +
geom_histogram(color = "white", fill = "#00843D") +
labs(x = "Percentage of first preference votes per division",
y = "Count",
title = "First preference votes for the Greens party")
9/26
  • Australia uses full-preference instant-runoff voting in single member seats
  • Following the full allocation of preferences, it is possible to derive a two-party-preferred figure, where the votes have been allocated between the two main candidates in the election.
  • In Australia, this is usually between the candidates from the Coalition parties and the Australian Labor Party.

Formulating questions for EDA

  • Think broad (open-ended) questions that promotes discussion and divergent thinking
10/26

Formulating questions for EDA

  • Think broad (open-ended) questions that promotes discussion and divergent thinking
  • Polar questions (i.e. the answer is yes or no) are restrictive in exploring the data
10/26

Formulating questions for EDA

  • Think broad (open-ended) questions that promotes discussion and divergent thinking
  • Polar questions (i.e. the answer is yes or no) are restrictive in exploring the data
  • For example,
Is the outlying observation the electoral district that won the seat?

10/26

Formulating questions for EDA

  • Think broad (open-ended) questions that promotes discussion and divergent thinking
  • Polar questions (i.e. the answer is yes or no) are restrictive in exploring the data
  • For example,
Is the outlying observation the electoral district that won the seat?
What is characterising the distribution of the percentage of first preference votes for the Greens party?

10/26

Formulating questions for EDA

  • Think broad (open-ended) questions that promotes discussion and divergent thinking
  • Polar questions (i.e. the answer is yes or no) are restrictive in exploring the data
  • For example,
Is the outlying observation the electoral district that won the seat?
What is characterising the distribution of the percentage of first preference votes for the Greens party?

  • What promotes a higher level of exploration?
10/26

Case study 1 2019 Australian Federal Election Part 6/8

% of first preference for the Greens
State Mean Median SD IQR Skewness Kurtosis
ACT 16.406 13.988 5.602 5.196 0.645 1.500
VIC 11.400 8.570 8.210 6.717 2.603 11.360
WA 10.993 10.756 3.018 3.116 0.802 3.026
QLD 9.764 8.808 5.096 4.753 1.092 3.886
TAS 9.721 9.339 4.009 0.985 0.326 2.493
NT 9.572 9.572 2.473 1.748 0.000 1.000
SA 9.120 8.903 3.024 3.412 0.384 2.920
NSW 8.101 6.635 4.087 3.948 1.502 4.859
National 9.874 8.547 5.632 5.001 2.671 15.798
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
group_by(State) %>%
summarise(mean = mean(perc_GRN),
median = median(perc_GRN),
sd = sd(perc_GRN),
iqr = IQR(perc_GRN),
skewness = moments::skewness(perc_GRN),
kurtosis = moments::kurtosis(perc_GRN)) %>%
arrange(desc(mean)) %>%
rbind(data.frame(State = "National",
mean = mean(tdf3$perc_GRN),
median = median(tdf3$perc_GRN),
sd = sd(tdf3$perc_GRN),
iqr = IQR(tdf3$perc_GRN),
skewness = moments::skewness(tdf3$perc_GRN),
kurtosis = moments::kurtosis(tdf3$perc_GRN))) %>%
knitr::kable(col.names = c("State", "Mean", "Median", "SD", "IQR", "Skewness", "Kurtosis"), digits = 3) %>%
kableExtra::kable_classic() %>%
kableExtra::add_header_above(c(" ", "% of first preference for the Greens" = 4, " " = 2)) %>%
kableExtra::row_spec(9, extra_css = "border-top: 2px solid black;")
11/26

Case study 1 2019 Australian Federal Election Part 6/8

% of first preference for the Greens
State Mean Median SD IQR Skewness Kurtosis
ACT 16.406 13.988 5.602 5.196 0.645 1.500
VIC 11.400 8.570 8.210 6.717 2.603 11.360
WA 10.993 10.756 3.018 3.116 0.802 3.026
QLD 9.764 8.808 5.096 4.753 1.092 3.886
TAS 9.721 9.339 4.009 0.985 0.326 2.493
NT 9.572 9.572 2.473 1.748 0.000 1.000
SA 9.120 8.903 3.024 3.412 0.384 2.920
NSW 8.101 6.635 4.087 3.948 1.502 4.859
National 9.874 8.547 5.632 5.001 2.671 15.798
  • Why are the means and the medians different?

  • How are the standard deviations and the interquartile ranges similar or different?

  • Are there some other numerical statistics we should show?

01:00
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
group_by(State) %>%
summarise(mean = mean(perc_GRN),
median = median(perc_GRN),
sd = sd(perc_GRN),
iqr = IQR(perc_GRN),
skewness = moments::skewness(perc_GRN),
kurtosis = moments::kurtosis(perc_GRN)) %>%
arrange(desc(mean)) %>%
rbind(data.frame(State = "National",
mean = mean(tdf3$perc_GRN),
median = median(tdf3$perc_GRN),
sd = sd(tdf3$perc_GRN),
iqr = IQR(tdf3$perc_GRN),
skewness = moments::skewness(tdf3$perc_GRN),
kurtosis = moments::kurtosis(tdf3$perc_GRN))) %>%
knitr::kable(col.names = c("State", "Mean", "Median", "SD", "IQR", "Skewness", "Kurtosis"), digits = 3) %>%
kableExtra::kable_classic() %>%
kableExtra::add_header_above(c(" ", "% of first preference for the Greens" = 4, " " = 2)) %>%
kableExtra::row_spec(9, extra_css = "border-top: 2px solid black;")
11/26

Robust measure of central tendency

  • Mean is a non-robust measure of location.
  • Median is the 50% quantile of the observations
  • Trimmed mean is the sample mean after discarding observations at the tails.
  • Winsorized mean is the sample mean after replacing observations at the tails with the minimum or maximum of the observations that remain.

Plot Mean Median Trimmed Mean* Winsorized Mean*
1 0.109 0.114 0.120 0.103
2 0.054 -0.045 -0.016 -0.029
3 1.177 0.729 0.820 0.888
4 0.523 0.552 0.531 0.530
5 0.432 0.285 0.337 0.361
6 2.972 2.477 2.624 2.721

* Both trimmed and Winsorized mean trimmed 20% of the tails.

12/26

Robust measure of dispersion

  • Standard deviation or its square, variance, is a popular choice of measure of dispersion but is not robust to outliers
  • Standard deviation for sample x1,...,xn is calculated as ni=1(xiˉx)2n1
  • Interquartile range is the difference between 1st and 3rd quartile and is a more robust measure of spread
  • Median absolute deviance (MAD) is also more robust and defined as median(|ximedian(xi)|)

Measure of dispersion
Plot SD IQR MAD Skewness Kurtosis
1 0.898 1.186 0.870 -0.072 3.008
2 0.986 1.411 1.077 0.358 2.212
3 1.326 1.176 0.793 1.944 7.184
4 0.288 0.450 0.335 -0.126 1.837
5 0.468 0.499 0.343 1.691 6.372
6 2.784 5.362 2.984 -0.351 1.678
13/26

Case study 1 2019 Australian Federal Election Part 7/8

We should plot the data!

  • The width of the boxplot is proportional to the number of electoral districts in the corresponding state (which is roughly proportional to the population)

What do you notice from this graph?

01:30
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
mutate(State = fct_reorder(State, perc_GRN)) %>%
ggplot(aes(perc_GRN, State)) +
geom_boxplot(varwidth = TRUE) +
labs(x = "Percentage of first preference votes per division",
y = "State",
title = "First preference votes for the Greens party")
14/26

Outliers

Outliers are observations that are significantly different from the majority.


  • Outliers can occur by chance in almost all distributions, but could be indicative of:
    • a measurement error,
    • a different population, or
    • an issue with the sampling process.

15/26

Closer look at the boxplot

  • Observations that are outside the range of lower to upper thresholds are referred at times as outliers
  • Plotting boxplots for data from a skewed distribution will almost always show these "outliers" but these are not necessary outliers
  • Some definitions of outliers assume a symmetrical population distribution (e.g. in boxplots or observations a certain standard deviations away from the mean) and these definitions are ill-suited for asymmetrical distributions
16/26

Closer look at the boxplot

  • Observations that are outside the range of lower to upper thresholds are referred at times as outliers
  • Plotting boxplots for data from a skewed distribution will almost always show these "outliers" but these are not necessary outliers
  • Some definitions of outliers assume a symmetrical population distribution (e.g. in boxplots or observations a certain standard deviations away from the mean) and these definitions are ill-suited for asymmetrical distributions

But are there some things we cannot see from boxplots?

16/26

Case study 1 2019 Australian Federal Election Part 8/8

tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
mutate(State = fct_reorder(State, perc_GRN)) %>%
ggplot(aes(perc_GRN, State)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
labs(x = "Percentage of first preference votes per division",
y = "State",
title = "First preference votes for the Greens party")
17/26

Case study 1 2019 Australian Federal Election Part 8/8

Now what do you notice from this graph that you didn't notice before?

tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
mutate(State = fct_reorder(State, perc_GRN)) %>%
ggplot(aes(perc_GRN, State)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
labs(x = "Percentage of first preference votes per division",
y = "State",
title = "First preference votes for the Greens party")
17/26

Case study 1 2019 Australian Federal Election Part 8/8

Now what do you notice from this graph that you didn't notice before?

  • There are only two electoral districts in NT!
  • And only 3 and 5 electoral districts in ACT and TAS, respectively!
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
mutate(State = fct_reorder(State, perc_GRN)) %>%
ggplot(aes(perc_GRN, State)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
labs(x = "Percentage of first preference votes per division",
y = "State",
title = "First preference votes for the Greens party")
17/26

Case study 1 2019 Australian Federal Election Part 8/8

Now what do you notice from this graph that you didn't notice before?

  • There are only two electoral districts in NT!
  • And only 3 and 5 electoral districts in ACT and TAS, respectively!
  • We have not computed the number of electoral districts for each state so far!
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
mutate(State = fct_reorder(State, perc_GRN)) %>%
ggplot(aes(perc_GRN, State)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
labs(x = "Percentage of first preference votes per division",
y = "State",
title = "First preference votes for the Greens party")
17/26

Case study 1 2019 Australian Federal Election Part 8/8

Now what do you notice from this graph that you didn't notice before?

  • There are only two electoral districts in NT!
  • And only 3 and 5 electoral districts in ACT and TAS, respectively!
  • We have not computed the number of electoral districts for each state so far!
Both numerical and graphical summaries can either reveal and/or hide aspects of the data
tdf3 <- df1 %>%
group_by(DivisionID) %>%
summarise(DivisionNm = unique(DivisionNm),
State = unique(StateAb),
votes_GRN = TotalVotes[which(PartyAb=="GRN")],
votes_total = sum(TotalVotes)) %>%
mutate(perc_GRN = votes_GRN / votes_total * 100)
tdf3 %>%
mutate(State = fct_reorder(State, perc_GRN)) %>%
ggplot(aes(perc_GRN, State)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
labs(x = "Percentage of first preference votes per division",
y = "State",
title = "First preference votes for the Greens party")
17/26

Transformations

18/26

Case study 2 Melbourne Housing Prices Part 1/5

Suburb Rooms Type Price ($) Date
Abbotsford 3 Home 1,490,000 2017-04-01
Abbotsford 3 Home 1,220,000 2017-04-01
Abbotsford 3 Home 1,420,000 2017-04-01
Aberfeldie 3 Home 1,515,000 2017-04-01
Airport West 2 Home 670,000 2017-04-01
Airport West 2 Townhouse 530,000 2017-04-01
Airport West 2 Unit 540,000 2017-04-01
Airport West 3 Home 715,000 2017-04-01
Albanvale 6 Home NA 2017-04-01
Albert Park 3 Home 1,925,000 2017-04-01
Albion 3 Unit 515,000 2017-04-01
Albion 4 Home 717,000 2017-04-01
Alphington 2 Home 1,675,000 2017-04-01
Alphington 4 Home 2,008,000 2017-04-01
Altona 2 Home 860,000 2017-04-01
Altona Meadows 4 Home NA 2017-04-01
Altona North 3 Home 720,000 2017-04-01
Armadale 2 Unit 836,000 2017-04-01
Armadale 2 Home 2,110,000 2017-04-01
Armadale 3 Home 1,386,000 2017-04-01
  • This data was scrapped each week from domain.com.au from 2016-01-28 to 2018-10-13
  • In total there are 63,023 observations
  • All variables shown (there are more variables not shown here), except price, have complete records
  • The are 48,433 property prices across Melbourne (roughly 23% missing)

Data source: Tony Pio (2018) Melbourne Housing Market, Version 27. Retrieved August 2021 from https://www.kaggle.com/anthonypino/melbourne-housing-market.

19/26

Case study 2 Melbourne Housing Prices Part 1/5

Suburb Rooms Type Price ($) Date
Abbotsford 3 Home 1,490,000 2017-04-01
Abbotsford 3 Home 1,220,000 2017-04-01
Abbotsford 3 Home 1,420,000 2017-04-01
Aberfeldie 3 Home 1,515,000 2017-04-01
Airport West 2 Home 670,000 2017-04-01
Airport West 2 Townhouse 530,000 2017-04-01
Airport West 2 Unit 540,000 2017-04-01
Airport West 3 Home 715,000 2017-04-01
Albanvale 6 Home NA 2017-04-01
Albert Park 3 Home 1,925,000 2017-04-01
Albion 3 Unit 515,000 2017-04-01
Albion 4 Home 717,000 2017-04-01
Alphington 2 Home 1,675,000 2017-04-01
Alphington 4 Home 2,008,000 2017-04-01
Altona 2 Home 860,000 2017-04-01
Altona Meadows 4 Home NA 2017-04-01
Altona North 3 Home 720,000 2017-04-01
Armadale 2 Unit 836,000 2017-04-01
Armadale 2 Home 2,110,000 2017-04-01
Armadale 3 Home 1,386,000 2017-04-01
  • This data was scrapped each week from domain.com.au from 2016-01-28 to 2018-10-13
  • In total there are 63,023 observations
  • All variables shown (there are more variables not shown here), except price, have complete records
  • The are 48,433 property prices across Melbourne (roughly 23% missing)

How would you explore this data first?

01:00

Data source: Tony Pio (2018) Melbourne Housing Market, Version 27. Retrieved August 2021 from https://www.kaggle.com/anthonypino/melbourne-housing-market.

19/26

Case study 2 Melbourne Housing Prices Part 2/5

Observations arranged by Suburb and Date:

Comparing distribution of room number for observations with missing and non-missing price records:

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
skimr::skim(df2)
## ── Data Summary ────────────────────────
## Values
## Name df2
## Number of rows 63023
## Number of columns 13
## _______________________
## Column type frequency:
## character 8
## Date 1
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 Suburb 0 1 3 18 0 380 0
## 2 Address 0 1 7 27 0 57754 0
## 3 Type 0 1 1 1 0 3 0
## 4 Method 0 1 1 2 0 9 0
## 5 SellerG 0 1 1 27 0 476 0
## 6 Postcode 0 1 4 4 0 225 0
## 7 Regionname 0 1 16 26 0 8 0
## 8 CouncilArea 0 1 17 30 0 34 0
##
## ── Variable type: Date ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max median n_unique
## 1 Date 0 1 2016-01-28 2018-10-13 2017-09-03 112
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Rooms 0 1 3.11 0.958 1 3 3 4 31 ▇▁▁▁▁
## 2 Price 14590 0.768 997898. 593499. 85000 620000 830000 1220000 11200000 ▇▁▁▁▁
## 3 Propertycount 0 1 7618. 4424. 39 4380 6795 10412 21650 ▅▇▅▂▁
## 4 Distance 0 1 12.7 7.59 0 7 11.4 16.7 64.1 ▇▆▁▁▁
df2 %>%
select(Suburb, Rooms, Type, Price, Date) %>%
arrange(Suburb, Date) %>%
visdat::vis_miss()
df2 %>%
mutate(miss = ifelse(is.na(Price), "Missing", "Recorded")) %>%
count(Rooms, miss) %>%
group_by(miss) %>%
mutate(perc = n / sum(n) * 100) %>%
ggplot(aes(as.factor(Rooms), perc, fill = miss)) +
geom_col(position = "dodge") +
scale_fill_viridis_d() +
labs(x = "Rooms", y = "Percentage", fill = "Price")
20/26

Case study 2 Melbourne Housing Prices Part 2/5

Observations arranged by Suburb and Date:

Comparing distribution of room number for observations with missing and non-missing price records:

  • Okay nothing notable as far as I can see
  • What next?
df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
skimr::skim(df2)
## ── Data Summary ────────────────────────
## Values
## Name df2
## Number of rows 63023
## Number of columns 13
## _______________________
## Column type frequency:
## character 8
## Date 1
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 Suburb 0 1 3 18 0 380 0
## 2 Address 0 1 7 27 0 57754 0
## 3 Type 0 1 1 1 0 3 0
## 4 Method 0 1 1 2 0 9 0
## 5 SellerG 0 1 1 27 0 476 0
## 6 Postcode 0 1 4 4 0 225 0
## 7 Regionname 0 1 16 26 0 8 0
## 8 CouncilArea 0 1 17 30 0 34 0
##
## ── Variable type: Date ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max median n_unique
## 1 Date 0 1 2016-01-28 2018-10-13 2017-09-03 112
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Rooms 0 1 3.11 0.958 1 3 3 4 31 ▇▁▁▁▁
## 2 Price 14590 0.768 997898. 593499. 85000 620000 830000 1220000 11200000 ▇▁▁▁▁
## 3 Propertycount 0 1 7618. 4424. 39 4380 6795 10412 21650 ▅▇▅▂▁
## 4 Distance 0 1 12.7 7.59 0 7 11.4 16.7 64.1 ▇▆▁▁▁
df2 %>%
select(Suburb, Rooms, Type, Price, Date) %>%
arrange(Suburb, Date) %>%
visdat::vis_miss()
df2 %>%
mutate(miss = ifelse(is.na(Price), "Missing", "Recorded")) %>%
count(Rooms, miss) %>%
group_by(miss) %>%
mutate(perc = n / sum(n) * 100) %>%
ggplot(aes(as.factor(Rooms), perc, fill = miss)) +
geom_col(position = "dodge") +
scale_fill_viridis_d() +
labs(x = "Rooms", y = "Percentage", fill = "Price")
20/26

Case study 2 Melbourne Housing Prices Part 3/5

What can we say from this plot?

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6)) +
geom_histogram(color = "white") +
labs(x = "Price ($1,000,000)",
y = "Count")
21/26

Case study 2 Melbourne Housing Prices Part 3/5

What can we say from this plot?

  • The housing prices are right-skewed
df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6)) +
geom_histogram(color = "white") +
labs(x = "Price ($1,000,000)",
y = "Count")
21/26

Case study 2 Melbourne Housing Prices Part 3/5

What can we say from this plot?

  • The housing prices are right-skewed

  • There appears to be a lot of outlying housing prices (how can we tell?)

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6)) +
geom_histogram(color = "white") +
labs(x = "Price ($1,000,000)",
y = "Count")
21/26

Case study 2 Melbourne Housing Prices Part 4/5

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6)) +
geom_histogram(color = "white") +
labs(x = "Price ($1,000,000)",
y = "Count") +
scale_x_log10()
22/26

Case study 2 Melbourne Housing Prices Part 4/5

  • The x-axis has been log10 transformed in this plot
df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6)) +
geom_histogram(color = "white") +
labs(x = "Price ($1,000,000)",
y = "Count") +
scale_x_log10()
22/26

Case study 2 Melbourne Housing Prices Part 4/5

  • The x-axis has been log10 transformed in this plot
  • The plot appears more symmetrical now
df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6)) +
geom_histogram(color = "white") +
labs(x = "Price ($1,000,000)",
y = "Count") +
scale_x_log10()
22/26

Case study 2 Melbourne Housing Prices Part 4/5

  • The x-axis has been log10 transformed in this plot
  • The plot appears more symmetrical now

  • What is a measure of central tendancy here?

With no transformation:

Mean Median Trimmed Mean Winsorised Mean
$997,898 $830,000 $871,375 $903,823

With log transformation (and back transformed to original scale):

Mean Median Trimmed Mean Winsorised Mean
$874,166 $830,000 $847,973 $859,325
df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6)) +
geom_histogram(color = "white") +
labs(x = "Price ($1,000,000)",
y = "Count") +
scale_x_log10()
22/26

Multi-modality

23/26

Case study 2 Melbourne Housing Prices Part 5/5

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6, as.factor(Rooms))) +
geom_violin() +
geom_boxplot(width = 0.1) +
scale_x_log10() +
labs(x = "Price ($1,000,000)", y = "# of Rooms")
24/26

Case study 2 Melbourne Housing Prices Part 5/5

  • You can see that drawing separate violin and box plots for each room number show that higher number of rooms generally are pricier
  • You could not see this, however, when the data are combined

df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"),
col_types = cols(
.default = col_character(),
Rooms = col_double(),
Price = col_double(),
Date = col_date(format = "%d/%m/%Y"),
Propertycount = col_double(),
Distance = col_double()))
df2 %>%
ggplot(aes(Price/1e6, as.factor(Rooms))) +
geom_violin() +
geom_boxplot(width = 0.1) +
scale_x_log10() +
labs(x = "Price ($1,000,000)", y = "# of Rooms")
24/26

Take away messages

25/26

Take away messages

  • Numerical and graphical summaries can reveal, but also hide, aspects of data
25/26

Take away messages

  • Numerical and graphical summaries can reveal, but also hide, aspects of data
  • Do many numerical and graphical summaries of the data!
25/26

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 4 - Session 1


26/26

ETC5521: Exploratory Data Analysis


Working with a single variable, making transformations, detecting outliers, using robust statistics

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 4 - Session 1


1/26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow