These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .
Press the right arrow to progress to the next slide!
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 8 - Session 1
Example
E.g. one may assume that yi=β0+β1xi+β2x2i+ϵi, where ϵi∼NID(0,σ2) for i=1,...,n,
Example
E.g. one may assume that yi=β0+β1xi+β2x2i+ϵi, where ϵi∼NID(0,σ2) for i=1,...,n,
Example
E.g. one may assume that yi=β0+β1xi+β2x2i+ϵi, where ϵi∼NID(0,σ2) for i=1,...,n,
Example
E.g. one may assume that yi=β0+β1xi+β2x2i+ϵi, where ϵi∼NID(0,σ2) for i=1,...,n,
Examples
Assuming a quadratic fit:
set.seed(1)df <- tibble(id = 1:200)df
## # A tibble: 200 × 1## id## <int>## 1 1## 2 2## 3 3## 4 4## 5 5## 6 6## 7 7## 8 8## 9 9## 10 10## # … with 190 more rows
set.seed(1)df <- tibble(id = 1:200) %>% mutate(x = runif(n(), -10, 10))df
## # A tibble: 200 × 2## id x## <int> <dbl>## 1 1 -4.69## 2 2 -2.56## 3 3 1.46## 4 4 8.16## 5 5 -5.97## 6 6 7.97## 7 7 8.89## 8 8 3.22## 9 9 2.58## 10 10 -8.76## # … with 190 more rows
set.seed(1)df <- tibble(id = 1:200) %>% mutate(x = runif(n(), -10, 10), y = x^2 + rnorm(n(), 0, 2))df
## # A tibble: 200 × 3## id x y## <int> <dbl> <dbl>## 1 1 -4.69 20.8 ## 2 2 -2.56 6.63 ## 3 3 1.46 0.301## 4 4 8.16 67.0 ## 5 5 -5.97 34.3 ## 6 6 7.97 67.0 ## 7 7 8.89 80.5 ## 8 8 3.22 12.2 ## 9 9 2.58 7.44 ## 10 10 -8.76 80.2 ## # … with 190 more rows
set.seed(1)df <- tibble(id = 1:200) %>% mutate(x = runif(n(), -10, 10), y = x^2 + rnorm(n(), 0, 2))
Plotting this:
ggplot(df, aes(x, y)) + geom_point()
Logistic regression models the relationship between a set of explanatory variables (xi1,...,xik) and a set of binary outcomes Yi for i=1,...,n.
We assume that Yi∼B(1,pi)≡Bernoulli(pi) and the model is given by
logit(pi)=ln(pi1−pi)=β0+β1xi1+...+βkxik.
Logistic regression models the relationship between a set of explanatory variables (xi1,...,xik) and a set of binary outcomes Yi for i=1,...,n.
We assume that Yi∼B(1,pi)≡Bernoulli(pi) and the model is given by
logit(pi)=ln(pi1−pi)=β0+β1xi1+...+βkxik.
Logistic regression models the relationship between a set of explanatory variables (xi1,...,xik) and a set of binary outcomes Yi for i=1,...,n.
We assume that Yi∼B(1,pi)≡Bernoulli(pi) and the model is given by
logit(pi)=ln(pi1−pi)=β0+β1xi1+...+βkxik.
Taking the exponential of both sides and rearranging we get pi=11+e−(β0+β1xi1+...+βkxik).
The function f(p)=ln(p1−p) is called the logit function, continuous with range (−∞,∞), and if p is the probablity of an event, f(p) is the log of the odds.
Data:
mock_df
## # A tibble: 18 × 5## Patient Smoker Sex Cancer CancerBinary## <fct> <fct> <fct> <fct> <dbl>## 1 1 Yes Female No 0## 2 2 Yes Male No 0## 3 3 No Female Yes 1## 4 4 Yes Male No 0## 5 5 Yes Female Yes 1## 6 6 No Female No 0## 7 7 Yes Female Yes 1## 8 8 No Female No 0## 9 9 No Female No 0## 10 10 No Male No 0## 11 11 Yes Male No 0## 12 12 Yes Female Yes 1## 13 13 Yes Male No 0## 14 14 Yes Female No 0## 15 15 No Male Yes 1## 16 16 No Female Yes 1## 17 17 No Male No 0## 18 18 No Male Yes 1
Summarised data:
mock_sumdf
## # A tibble: 4 × 4## # Groups: Smoker [2]## Smoker Sex Cancer Total## <fct> <fct> <int> <int>## 1 No Female 2 5## 2 No Male 2 4## 3 Yes Female 3 5## 4 Yes Male 0 4
Data:
mock_df
## # A tibble: 18 × 5## Patient Smoker Sex Cancer CancerBinary## <fct> <fct> <fct> <fct> <dbl>## 1 1 Yes Female No 0## 2 2 Yes Male No 0## 3 3 No Female Yes 1## 4 4 Yes Male No 0## 5 5 Yes Female Yes 1## 6 6 No Female No 0## 7 7 Yes Female Yes 1## 8 8 No Female No 0## 9 9 No Female No 0## 10 10 No Male No 0## 11 11 Yes Male No 0## 12 12 Yes Female Yes 1## 13 13 Yes Male No 0## 14 14 Yes Female No 0## 15 15 No Male Yes 1## 16 16 No Female Yes 1## 17 17 No Male No 0## 18 18 No Male Yes 1
Summarised data:
mock_sumdf
## # A tibble: 4 × 4## # Groups: Smoker [2]## Smoker Sex Cancer Total## <fct> <fct> <int> <int>## 1 No Female 2 5## 2 No Male 2 4## 3 Yes Female 3 5## 4 Yes Male 0 4
glm(Cancer ~ Smoker + Sex, family = binomial(link = "logit"), data = mock_df)
## ## Call: glm(formula = Cancer ~ Smoker + Sex, family = binomial(link = "logit"), ## data = mock_df)## ## Coefficients:## (Intercept) SmokerYes SexMale ## 0.2517 -0.5034 -1.1145 ## ## Degrees of Freedom: 17 Total (i.e. Null); 15 Residual## Null Deviance: 24.06 ## Residual Deviance: 22.61 AIC: 28.61
glm(cbind(Cancer, Total - Cancer) ~ Smoker + Sex, family = binomial(link = "logit"), data = mock_sumdf)
## ## Call: glm(formula = cbind(Cancer, Total - Cancer) ~ Smoker + Sex, family = binomial(link = "logit"), ## data = mock_sumdf)## ## Coefficients:## (Intercept) SmokerYes SexMale ## 0.2517 -0.5034 -1.1145 ## ## Degrees of Freedom: 3 Total (i.e. Null); 1 Residual## Null Deviance: 5.052 ## Residual Deviance: 3.604 AIC: 15.82
df <- tibble(id = 1:2000) %>% mutate(Smoker = rep(c("Yes", "No"), each = n() / 2), Sex = rep(c("Female", "Male"), times = n() / 2)) print(df, n = 4)
## # A tibble: 2,000 × 3## id Smoker Sex ## <int> <chr> <chr> ## 1 1 Yes Female## 2 2 Yes Male ## 3 3 Yes Female## 4 4 Yes Male ## # … with 1,996 more rows
df %>% group_by(Smoker, Sex) %>% count()
## # A tibble: 4 × 3## # Groups: Smoker, Sex [4]## Smoker Sex n## <chr> <chr> <int>## 1 No Female 500## 2 No Male 500## 3 Yes Female 500## 4 Yes Male 500
df <- tibble(id = 1:2000) %>% mutate(Smoker = rep(c("Yes", "No"), each = n() / 2), Sex = rep(c("Female", "Male"), times = n() / 2)) %>% rowwise() %>% mutate(CancerBinary = case_when(Smoker=="Yes" & Sex=="Female" ~ rbinom(1, 1, 0.075), Smoker=="Yes" & Sex=="Male" ~ rbinom(1, 1, 0.045), Smoker=="No" & Sex=="Female" ~ rbinom(1, 1, 0.005), Smoker=="No" & Sex=="Male" ~ rbinom(1, 1, 0.003)), Cancer = ifelse(CancerBinary, "Yes", "No"))df %>% filter(Cancer=="Yes")
## # A tibble: 53 × 5## # Rowwise: ## id Smoker Sex CancerBinary Cancer## <int> <chr> <chr> <int> <chr> ## 1 27 Yes Female 1 Yes ## 2 32 Yes Male 1 Yes ## 3 47 Yes Female 1 Yes ## 4 83 Yes Female 1 Yes ## 5 129 Yes Female 1 Yes ## 6 136 Yes Male 1 Yes ## 7 149 Yes Female 1 Yes ## 8 218 Yes Male 1 Yes ## 9 245 Yes Female 1 Yes ## 10 251 Yes Female 1 Yes ## # … with 43 more rows
At times, you may want to simulate the summary data directly instead of the individual data
Recall that if Yi∼B(1,p) for i=1,...k and Yis are independent, S=Y1+Y2+...+Yk∼B(k,p)
expand_grid(Smoker = c("Yes", "No"), Sex = c("Female", "Male")) %>% rowwise() %>% mutate(Cancer = case_when(Smoker=="Yes" & Sex=="Female" ~ rbinom(1, 500, 0.075), Smoker=="Yes" & Sex=="Male" ~ rbinom(1, 500, 0.045), Smoker=="No" & Sex=="Female" ~ rbinom(1, 500, 0.005), Smoker=="No" & Sex=="Male" ~ rbinom(1, 500, 0.003)), Total = 500)
## # A tibble: 4 × 4## # Rowwise: ## Smoker Sex Cancer Total## <chr> <chr> <int> <dbl>## 1 Yes Female 35 500## 2 Yes Male 23 500## 3 No Female 0 500## 4 No Male 3 500
expand_grid(Smoker = c("Yes", "No"), Sex = c("Female", "Male")) %>% rowwise() %>% mutate(Cancer = case_when(Smoker=="Yes" & Sex=="Female" ~ rbinom(1, 500, 0.075), Smoker=="Yes" & Sex=="Male" ~ rbinom(1, 500, 0.045), Smoker=="No" & Sex=="Female" ~ rbinom(1, 500, 0.005), Smoker=="No" & Sex=="Male" ~ rbinom(1, 500, 0.003)), Total = 500)
## # A tibble: 4 × 4## # Rowwise: ## Smoker Sex Cancer Total## <chr> <chr> <int> <dbl>## 1 Yes Female 35 500## 2 Yes Male 23 500## 3 No Female 0 500## 4 No Male 3 500
data(menarche, package = "MASS")skimr::skim(menarche)
## ── Data Summary ────────────────────────## Values ## Name menarche## Number of rows 25 ## Number of columns 3 ## _______________________ ## Column type frequency: ## numeric 3 ## ________________________ ## Group variables None ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Age 0 1 13.1 2.03 9.21 11.6 13.1 14.6 17.6 ▅▇▇▇▁## 2 Total 0 1 157. 195. 88 98 105 117 1049 ▇▁▁▁▁## 3 Menarche 0 1 92.3 204. 0 10 51 92 1049 ▇▁▁▁▁
# can't fit the model below using geom_smooth # so manually fitmenarche_model <- glm(cbind(Menarche, Total - Menarche) ~ Age, data = menarche, family = binomial(link = "logit"))# length.out = 80 to match geom_smooth defaultmenarche_pred <- data.frame(Age = seq(min(menarche$Age), max(menarche$Age), length.out = 80))menarche_pred <- menarche_pred %>% mutate(fit = predict(menarche_model, newdata = menarche_pred, type = "response"))ggplot(menarche, aes(Age, Menarche/Total)) + geom_point() + geom_line(data = menarche_pred, aes(y = fit), color = "blue")
Milicer, H. and Szczotka, F. (1966) Age at Menarche in Warsaw girls in 1965. Human Biology 38, 199–203.
fit1 <- glm(cbind(Menarche, Total - Menarche) ~ Age, family = "binomial", data = menarche)(beta <- coef(fit1))
## (Intercept) Age ## -21.226395 1.631968
menarche %>% rowwise() %>% mutate( phat = 1/(1 + exp(-(beta[1] + beta[2] * Age))), simMenarche = rbinom(1, Total, phat))
## # A tibble: 25 × 5## # Rowwise: ## Age Total Menarche phat simMenarche## <dbl> <dbl> <dbl> <dbl> <int>## 1 9.21 376 0 0.00203 1## 2 10.2 200 0 0.0103 3## 3 10.6 93 0 0.0187 2## 4 10.8 120 2 0.0279 3## 5 11.1 90 2 0.0413 1## 6 11.3 88 5 0.0609 6## 7 11.6 105 10 0.0888 9## 8 11.8 111 17 0.128 12## 9 12.1 100 16 0.181 17## 10 12.3 93 29 0.249 23## # … with 15 more rows
simulate
function which works for many model objects in Rfit1
logistic model objectsimulate(fit1, nsim = 3)
## sim_1.Menarche sim_1.V2 sim_2.Menarche sim_2.V2 sim_3.Menarche sim_3.V2## 1 0 376 0 376 0 376## 2 2 198 1 199 0 200## 3 4 89 0 93 3 90## 4 4 116 2 118 6 114## 5 8 82 5 85 3 87## 6 6 82 3 85 6 82## 7 14 91 7 98 5 100## 8 13 98 14 97 16 95## 9 21 79 18 82 20 80## 10 25 68 21 72 27 66## 11 47 53 32 68 33 67## 12 37 71 43 65 41 67## 13 47 52 54 45 59 40## 14 57 49 63 43 68 38## 15 76 29 82 23 84 21## 16 87 30 96 21 91 26## 17 83 15 82 16 84 14## 18 90 7 88 9 91 6## 19 107 13 108 12 113 7## 20 97 5 98 4 97 5## 21 115 7 119 3 119 3## 22 108 3 108 3 108 3## 23 93 1 93 1 92 2## 24 112 2 112 2 114 0## 25 1048 1 1049 0 1048 1
df1 <- menarche %>% mutate( pexp = 1/(1 + exp(-(beta[1] + beta[2] * Age))), pobs = Menarche / Total)
df1 <- menarche %>% mutate( pexp = 1/(1 + exp(-(beta[1] + beta[2] * Age))), pobs = Menarche / Total)
Goodness-of-fit type test is used commonly to assess the fit as well.
E.g. Hosmer–Lemeshow test, where test statistic is given as
H=r∑i=1((O1i−E1g)2E1i+(O0i−E0g)2E0i) where O1i (E1i) and O0i (E0i) are observed (expected) frequencies for successful and non-successful events for group i, respectively.
vcdExtra::HLtest(fit1)
## Hosmer and Lemeshow Goodness-of-Fit Test ## ## Call:## glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = "binomial", ## data = menarche)## ChiSquare df P_value## 0.1088745 8 0.9999996
For i∈{1,...,n},
Yi=β0+β1xi1+...+βkxik+ϵi, where ϵi∼NID(0,σ2) or in matrix format,
\boldsymbol{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2 \mathbf{I}_n)
For i \in \{1, ..., n\},
Y_i = \beta_0 + \beta_1x_{i1} + ... + \beta_{k}x_{ik} + \epsilon_i, where \color{red}{\epsilon_i \sim NID(0, \sigma^2)} or in matrix format,
\boldsymbol{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \color{red}{\boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2 \mathbf{I}_n)}
This means that we assume
For i \in \{1, ..., n\},
Y_i = \beta_0 + \beta_1x_{i1} + ... + \beta_{k}x_{ik} + \epsilon_i, where \color{red}{\epsilon_i \sim NID(0, \sigma^2)} or in matrix format,
\boldsymbol{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \color{red}{\boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2 \mathbf{I}_n)}
This means that we assume
So how do we check it?
Plot Y_i vs x_i to see if there is \approx a linear relationship between Y and x.
A boxplot of the residuals R_i to check for symmetry.
To check the homoscedasticity assumption, plot R_i vs x_i. There should be no obvious patterns.
A normal Q-Q plot, i.e. a plot of the ordered residuals vs \Phi^{-1}(\frac{i}{n+1}).
qqnorm(x)
produces a Q-Q plot of the ordered vector x
against the quantiles of the normal distribution.qqnorm
. fit <- lm(y ~ x)
By "hand"
plot(qnorm((1:n) / (n + 1)), sort(resid(fit)))
By base
qqnorm(resid(fit))qqline(resid(fit))
By ggplot2
data.frame(residual = resid(fit)) %>% ggplot(aes(sample = residual)) + stat_qq() + stat_qq_line(color="blue")
Reference: Hyndman and Fan (1996). Sample quantiles in statistical packages, American Statistician, 50, 361--365.
Simulation scheme
n <- 100x <- seq(0, 1, length.out = n)y1 <- x + rnorm(n) / 3 # Lineary2 <- 3 * (x - 0.5) ^ 2 + c(rnorm(n / 2)/3, rnorm(n / 2)/6) # Quadraticy3 <- -0.25 * sin(20 * x - 0.2) + x + rnorm(n) / 3 # Non-linearM1 <- lm(y1 ~ x); M2 <- lm(y2 ~ x); M3 <- lm(y3 ~ x)
tidyverse
suite of R packages xaringan
, remark.js, knitr
, and R Markdown.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 8 - Session 1
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 8 - Session 1
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |