ETC5512: Wild Caught Data
Week 8
Synthetic and Simulated Data
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
ETC5512.Clayton-x@monash.edu
13th May 2020
Press the right arrow to progress to the next slide!
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
ETC5512.Clayton-x@monash.edu
13th May 2020
Reading literacy is defined as students’ capacity to understand, use, evaluate, reflect on and engage with texts in order to achieve one’s goals, develop one’s knowledge and potential, and participate in society.
Mathematics literacy is defined as students’ capacity to formulate, employ and interpret mathematics in a variety of contexts. It includes reasoning mathematically and using mathematical concepts, procedures, facts and tools to describe, explain and predict phenomena.
Science literacy is defined as the ability to engage with science-related issues, and with the ideas of science, as a reflective citizen. A scientifically literate person is willing to engage in reasoned discourse about science and technology, which requires the competencies to explain phenomena scientifically, evaluate and design scientific enquiry, and interpret data and evidence scientifically.
http://www.oecd.org/pisa/data/2018database/
SPSS (TM) Data Files (compressed)
Student questionnaire data file
haven
package to import the PISA data.library(tidyverse)library(haven)pisa2018 <- read_sav(here::here("data", "CY07_MSU_STU_QQQ.sav")) %>% as_factor() # swap code and labels for labelled factorsdim(pisa2018)
## [1] 612004 1118
PV1MATH
= Plausible Value 1 in MathematicsPV1READ
= Plausible Value 1 in ReadingPV1SCIE
= Plausible Value 1 in Sciencepisa2018 %>% select(PV1MATH:PV10MATH) %>% pivot_longer(PV1MATH:PV10MATH, names_to = "Number", values_to = "Value") %>% # reorder factor so it is PV1MATH, ..., PV10MATH mutate(Number = fct_reorder(Number, Number, function(x) unique(parse_number(x)))) %>% ggplot(aes(x = Value, y = Number)) + labs(y = "") + ggridges::geom_density_ridges() + theme_classic(base_size = 18)
Perfect bell curves!
This section is based on upcoming book by Hofmann, Cook, Vanderplas and Wang.
pisa2018
data, the sex of the student is in variable ST004D01T
and the country/region is in variable CNT
. sex
and country
.PV1MATH
and will not cover any analysis that require us to use all 10 plausible values in this course.pisa2018c <- pisa2018 %>% rename(sex = ST004D01T, country = CNT) %>% filter(!is.na(sex)) %>% # filter two Canadian students where sex is missing filter(!is.na(PV1MATH)) %>% # Vietnam is missing scores mutate(country = case_when( country == "Brunei Darussalam" ~ "Brunei", country == "United Kingdom" ~ "UK", country %in% c("Hong Kong", "B-S-J-Z (China)") ~ "China", country == "Korea" ~ "South Korea", country == "North Macedonia" ~ "Macedonia", country == "Baku (Azerbaijan)" ~ "Baku", country %in% c("Moscow Region (RUS)", "Tatarstan (RUS)", "Russian Federation") ~ "Russia", country == "Slovak Republic" ~ "Slovakia", country == "Chinese Taipei" ~ "Taiwan", country == "United States" ~ "USA", TRUE ~ as.character(country)))
pisa2018c %>% group_by(sex, country) %>% summarise(avg = mean(PV1MATH)) %>% ungroup() %>% pivot_wider(country, names_from = sex, values_from = avg) %>% mutate(diff = Female - Male, country = fct_reorder(country, diff)) %>% ggplot(aes(x = diff, y = country)) + geom_point() + geom_vline(xintercept = 0, color = "red") + labs(y = "Country", x = "Difference in mean PV1 (girl - boy)") + theme_bw(base_size = 14)
pisa2018c %>% group_by(sex, country) %>% summarise(avg = mean(PV1MATH)) %>% ungroup() %>% pivot_wider(country, names_from = sex, values_from = avg) %>% mutate(diff = Female - Male, country = fct_reorder(country, diff)) %>% ggplot(aes(x = diff, y = country)) + geom_point() + geom_vline(xintercept = 0, color = "red") + labs(y = "Country", x = "Difference in mean PV1 (girl - boy)") + theme_bw(base_size = 14)
Sourced from PISA 2018 Integratated Design. Scroll down to see more information.
BOOKID
as Form 1-12 or 67-78 would have had mathematics component in their test.pisa2018 %>% filter(BOOKID == "Form 13") %>% select(CNT, ST004D01T, BOOKID, PV1MATH)
## # A tibble: 20,511 x 4## CNT ST004D01T BOOKID PV1MATH## <fct> <fct> <fct> <dbl>## 1 Albania Female Form 13 417.## 2 Albania Male Form 13 585.## 3 Albania Female Form 13 354.## 4 Albania Male Form 13 424.## 5 Albania Female Form 13 451.## 6 Albania Female Form 13 351.## 7 Albania Female Form 13 257.## 8 Albania Female Form 13 407.## 9 Albania Male Form 13 495.## 10 Albania Female Form 13 512.## # … with 20,501 more rows
BOOKID
as Form 1-12 or 67-78 would have had mathematics component in their test.pisa2018 %>% filter(BOOKID == "Form 13") %>% select(CNT, ST004D01T, BOOKID, PV1MATH)
## # A tibble: 20,511 x 4## CNT ST004D01T BOOKID PV1MATH## <fct> <fct> <fct> <dbl>## 1 Albania Female Form 13 417.## 2 Albania Male Form 13 585.## 3 Albania Female Form 13 354.## 4 Albania Male Form 13 424.## 5 Albania Female Form 13 451.## 6 Albania Female Form 13 351.## 7 Albania Female Form 13 257.## 8 Albania Female Form 13 407.## 9 Albania Male Form 13 495.## 10 Albania Female Form 13 512.## # … with 20,501 more rows
W_FSTUWT
). These scale the sample up to the size of the population within each country. If the unit of interest is the population of students within subset of countries, use this. SENWT
). These weights sum up to the same constant value, therefore each country will contribute equally to the analysis. If the unit of interest is the countries then use this. Jerrim et al. (2017) “What Happens When Econometrics and Psychometrics Collide? An Example Using the PISA Data.” Economics of Education Review 61 (December): 51–58.
7 | 2 | 2 | 6 | 2 | 5 | 4 | 9 | 2 | 7 | 5 | 1 | 7 | 0 | 3 | 2 | 4 | 1 | 4 | 5 | 8 | 7 | 5 | 1 |
The population average of this class is 4.12 with 3.9 for boys and 5.25 for girls.
7 | 2 | 2 | 6 | 2 | 5 | 4 | 9 | 2 | 7 | 5 | 1 | 7 | 0 | 3 | 2 | 4 | 1 | 4 | 5 | 8 | 7 | 5 | 1 |
The population average of this class is 4.12 with 3.9 for boys and 5.25 for girls.
7 | 2 | 2 | 6 | 2 | 5 | 4 | 9 | 2 | 7 | 5 | 1 | 7 | 0 | 3 | 2 | 4 | 1 | 4 | 5 | 8 | 7 | 5 | 1 |
The population average of this class is 4.12 with 3.9 for boys and 5.25 for girls.
7 | 2 | 2 | 6 | 2 | 5 | 4 | 9 | 2 | 7 | 5 | 1 | 7 | 0 | 3 | 2 | 4 | 1 | 4 | 5 | 8 | 7 | 5 | 1 |
The population average of this class is 4.12 with 3.9 for boys and 5.25 for girls.
In this case, the sampling weights are the inverse of the inclusion probability (20/3 for boys and 4/3 for girls).
A weighted mean, ˆμ, for values x1,...,xn with corresponding weights w1,...,wn is computed as
ˆμ=1∑ni=1win∑i=1wixi.
So the class population mean can be estimated as 20/3×4+4/3×5.333333320/3+4/3=4.2222222.
Notice that the estimate is closer to the class population mean.
Or you can use the weighted.mean
function in R.
mathdiff_df <- pisa2018c %>% group_by(sex, country) %>% summarise(math = weighted.mean(PV1MATH, w = SENWT)) %>% ungroup() %>% pivot_wider(country, names_from = sex, values_from = math) %>% mutate(diff = Female - Male, country = fct_reorder(country, diff)) ggplot(mathdiff_df, aes(x = diff, y = country)) + geom_point() + geom_vline(xintercept = 0, color = "red") + labs(y = "Country", x = "Difference in mean PV1 (girl - boy)") + theme_bw(base_size = 14)
map_data("world") %>% # function from ggplot2 left_join(mathdiff_df, by = c("region" = "country")) %>% ggplot(aes(long, lat, group = group, fill = diff)) + geom_polygon(color = "black") + theme_void(base_size = 18) + scale_fill_gradient2("Math Gap", na.value="grey90", low="#1B9E77", high="#D95F02", mid="white")
There are a number of ways of doing this in R but we will use sample_n
function in dplyr
📦.
set.seed(2020) # for reproducibilityboot_sample1 <- pisa2018c %>% group_by(country, sex) %>% sample_n(size = n(), replace = TRUE)
We can then treat boot_sample1
as we did before to obtain another set of estimates for the gender gap for mathematics score by country.
boot_sample1 %>%
summarise(avg = weighted.mean(PV1MATH, SENWT)) %>%
ungroup() %>%
pivot_wider(country, names_from = sex, values_from = avg) %>%
mutate(diff = Female - Male, country = fct_reorder(country, diff))
## # A tibble: 76 x 4## country Female Male diff## <fct> <dbl> <dbl> <dbl>## 1 Albania 438. 434. 3.74## 2 Argentina 373. 389. -15.4 ## 3 Australia 490. 495. -4.57## 4 Austria 497. 508. -11.1 ## 5 Baku 415. 424. -9.02## 6 Belarus 470. 472. -1.83## 7 Belgium 502. 517. -14.9 ## 8 Bosnia and Herzegovina 402. 410. -7.43## 9 Brazil 376. 391. -14.6 ## 10 Brunei 435. 425. 9.41## # … with 66 more rows
map_dfr
function from purrr
📦.boot_ests <- map_dfr(1:100, ~{
pisa2018c %>%
group_by(country, sex) %>%
sample_n(size = n(), replace = TRUE) %>%
summarise(avg = weighted.mean(PV1MATH, SENWT)) %>%
ungroup() %>%
pivot_wider(country, names_from = sex, values_from = avg) %>%
mutate(diff = Female - Male, country = fct_reorder(country, diff)) %>%
mutate(boot_id = .x)
})
.x
is substituted from an element from the first argument in map_dfr
.mathdiff2_df <- boot_ests %>% group_by(country) %>% summarise(lower = sort(diff)[5], upper = sort(diff)[95]) %>% left_join(mathdiff_df, by = "country") %>% mutate(country = fct_reorder(country, diff))
ggplot(mathdiff2_df, aes(diff, country)) + geom_point() + geom_errorbar(aes(xmin = lower, xmax = upper)) + geom_vline(xintercept = 0, color = "red") + labs(y = "Country", x = "Difference in mean PV1 (girl - boy)") + theme_bw(base_size = 14)
Suppose that n=200, β0=3, β1=−2 and σ2=1.
set.seed(2020) # for reproducibilityn <- 200 # sample sizeb0 <- 3 # interceptb1 <- -2 # slopesim_df <- tibble(id = 1:n) %>% # initialise data set mutate(x = runif(n(), 0, 10), # draw x from Uniform[0,10] y = b0 + b1 * x + rnorm(n(), 0, 1))
Obtain least squares estimates (or maximum likelihood estimate) for β0 and β1:
fit1 <- lm(y ~ x, data = sim_df)coef(fit1)
## (Intercept) x ## 2.975177 -2.020419
The estimates are pretty close to the true values!
sim2_df <- tibble(id = 1:n) %>% mutate(x = runif(n(), 0, 10), y = b0 + b1 * x + rgamma(n(), 2, 1) - 2)fit2 <- lm(y ~ x, data = sim2_df)coef(fit2)
## (Intercept) x ## 2.818092 -1.997902
combined_df <- sim2_df %>% mutate(sim = "Simulation 1") %>% rbind(mutate(sim_df, sim = "Simulation 2"))ggplot(combined_df, aes(x, y)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + facet_grid(sim ~ .) + theme_bw(base_size = 18)
combined2_df <- combined_df %>% mutate(residual = c(fit1$residuals, fit2$residuals))ggplot(combined2_df, aes(x, residual)) + geom_point() + geom_hline(yintercept = 0, color = "red") + facet_grid(sim ~ ., scales = "free") + theme_bw(base_size = 18)
H0:ei∼N(0,ˆσ2)vs.H1:not H0
c(summary(fit1)$sigma, summary(fit2)$sigma)
## [1] 1.140534 1.229640
nullabor
📦library(nullabor)method1 <- null_dist("residual", "norm", params = list(mean = 0, sd = summary(fit1)$sigma))sim_df$residual <- fit1$residualsnull1_df <- lineup(method1, sim_df)ggplot(null1_df, aes(x, residual)) + geom_point() + geom_hline(yintercept = 0, color = "red") + facet_wrap(~.sample) + theme_bw(base_size = 18)
## decrypt("E0Ui w676 VQ rnqV7VnQ KK")
## decrypt("E0Ui w676 VQ rnqV7VnQ 28")
nullabor
📦.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
ETC5512.Clayton-x@monash.edu
Lecturer: Emi Tanaka
Department of Econometrics and Business Statistics
ETC5512.Clayton-x@monash.edu
13th May 2020
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |