Processing math: 55%
+ - 0:00:00
Notes for current slide
Notes for next slide

These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .


Press the right arrow to progress to the next slide!

1/25

ETC5521: Exploratory Data Analysis


Sculpting data using models, checking assumptions, co-dependency and performing diagnostics

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 8 - Session 1


1/25

Parametric regression

2/25

Parametric regression

  • Parametric means that the researcher or analyst assumes in advance that the data fits some type of distribution (e.g. the normal distribution).
3/25

Parametric regression

  • Parametric means that the researcher or analyst assumes in advance that the data fits some type of distribution (e.g. the normal distribution).

Example

3/25

Parametric regression

  • Parametric means that the researcher or analyst assumes in advance that the data fits some type of distribution (e.g. the normal distribution).
  • E.g. one may assume that yi=β0+β1xi+β2x2i+ϵi, where ϵiNID(0,σ2) for i=1,...,n,

    • red=estimated
    • blue=observed

Example

3/25

Parametric regression

  • Parametric means that the researcher or analyst assumes in advance that the data fits some type of distribution (e.g. the normal distribution).
  • E.g. one may assume that yi=β0+β1xi+β2x2i+ϵi, where ϵiNID(0,σ2) for i=1,...,n,

    • red=estimated
    • blue=observed

Example

3/25

Parametric regression

  • Parametric means that the researcher or analyst assumes in advance that the data fits some type of distribution (e.g. the normal distribution).
  • E.g. one may assume that yi=β0+β1xi+β2x2i+ϵi, where ϵiNID(0,σ2) for i=1,...,n,

    • red=estimated
    • blue=observed
  • Because some type of distribution is assumed in advance, parametric fitting can lead to fitting a smooth curve that misrepresents the data.

Example

3/25

Parametric regression

  • Parametric means that the researcher or analyst assumes in advance that the data fits some type of distribution (e.g. the normal distribution).
  • E.g. one may assume that yi=β0+β1xi+β2x2i+ϵi, where ϵiNID(0,σ2) for i=1,...,n,

    • red=estimated
    • blue=observed
  • Because some type of distribution is assumed in advance, parametric fitting can lead to fitting a smooth curve that misrepresents the data.

Examples

Assuming a quadratic fit:

3/25

Simulating data from parametric models

  • Say a model is y=x2+e,eN(0,22).
  • Then we have y | xN(x2,22).
4/25

Simulating data from parametric models

  • Say a model is y=x2+e,eN(0,22).
  • Then we have y | xN(x2,22).
  • Let's draw 200 observations from this model.
set.seed(1)
df <- tibble(id = 1:200)
df
## # A tibble: 200 × 1
## id
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## # … with 190 more rows
5/25

Simulating data from parametric models

  • Say a model is y=x2+e,eN(0,22).
  • Then we have y | xN(x2,22).
  • Let's draw 200 observations from this model.
  • Suppose that x(10,10) and that we have uniform coverage over the support.
set.seed(1)
df <- tibble(id = 1:200) %>%
mutate(x = runif(n(), -10, 10))
df
## # A tibble: 200 × 2
## id x
## <int> <dbl>
## 1 1 -4.69
## 2 2 -2.56
## 3 3 1.46
## 4 4 8.16
## 5 5 -5.97
## 6 6 7.97
## 7 7 8.89
## 8 8 3.22
## 9 9 2.58
## 10 10 -8.76
## # … with 190 more rows
5/25

Simulating data from parametric models

  • Say a model is y=x2+e,eN(0,22).
  • Then we have y | xN(x2,22).
  • Let's draw 200 observations from this model.
  • Suppose that x(10,10) and that we have uniform coverage over the support.
  • The response y is generated as per above model.
set.seed(1)
df <- tibble(id = 1:200) %>%
mutate(x = runif(n(), -10, 10),
y = x^2 + rnorm(n(), 0, 2))
df
## # A tibble: 200 × 3
## id x y
## <int> <dbl> <dbl>
## 1 1 -4.69 20.8
## 2 2 -2.56 6.63
## 3 3 1.46 0.301
## 4 4 8.16 67.0
## 5 5 -5.97 34.3
## 6 6 7.97 67.0
## 7 7 8.89 80.5
## 8 8 3.22 12.2
## 9 9 2.58 7.44
## 10 10 -8.76 80.2
## # … with 190 more rows
5/25

Simulating data from parametric models

  • Say a model is y=x2+e,eN(0,22).
  • Then we have y | xN(x2,22).
  • Let's draw 200 observations from this model.
  • Suppose that x(10,10) and that we have uniform coverage over the support.
  • The response y is generated as per above model.
set.seed(1)
df <- tibble(id = 1:200) %>%
mutate(x = runif(n(), -10, 10),
y = x^2 + rnorm(n(), 0, 2))

Plotting this:

ggplot(df, aes(x, y)) +
geom_point()

5/25

Logistic regression

6/25

Logistic regression

  • Not all parametric models assume Normally distributed errors nor continuous responses.
7/25

Logistic regression

  • Not all parametric models assume Normally distributed errors nor continuous responses.
  • Logistic regression models the relationship between a set of explanatory variables (xi1,...,xik) and a set of binary outcomes Yi for i=1,...,n.
7/25

Logistic regression

  • Not all parametric models assume Normally distributed errors nor continuous responses.
  • Logistic regression models the relationship between a set of explanatory variables (xi1,...,xik) and a set of binary outcomes Yi for i=1,...,n.

  • We assume that YiB(1,pi)Bernoulli(pi) and the model is given by

logit(pi)=ln(pi1pi)=β0+β1xi1+...+βkxik.

7/25

Logistic regression

  • Not all parametric models assume Normally distributed errors nor continuous responses.
  • Logistic regression models the relationship between a set of explanatory variables (xi1,...,xik) and a set of binary outcomes Yi for i=1,...,n.

  • We assume that YiB(1,pi)Bernoulli(pi) and the model is given by

logit(pi)=ln(pi1pi)=β0+β1xi1+...+βkxik.

  • Taking the exponential of both sides and rearranging we get pi=11+e(β0+β1xi1+...+βkxik).
7/25

Logistic regression

  • Not all parametric models assume Normally distributed errors nor continuous responses.
  • Logistic regression models the relationship between a set of explanatory variables (xi1,...,xik) and a set of binary outcomes Yi for i=1,...,n.

  • We assume that YiB(1,pi)Bernoulli(pi) and the model is given by

logit(pi)=ln(pi1pi)=β0+β1xi1+...+βkxik.

  • Taking the exponential of both sides and rearranging we get pi=11+e(β0+β1xi1+...+βkxik).

  • The function f(p)=ln(p1p) is called the logit function, continuous with range (,), and if p is the probablity of an event, f(p) is the log of the odds.

7/25

Representation of data for binary outcomes

Data:

mock_df
## # A tibble: 18 × 5
## Patient Smoker Sex Cancer CancerBinary
## <fct> <fct> <fct> <fct> <dbl>
## 1 1 Yes Female No 0
## 2 2 Yes Male No 0
## 3 3 No Female Yes 1
## 4 4 Yes Male No 0
## 5 5 Yes Female Yes 1
## 6 6 No Female No 0
## 7 7 Yes Female Yes 1
## 8 8 No Female No 0
## 9 9 No Female No 0
## 10 10 No Male No 0
## 11 11 Yes Male No 0
## 12 12 Yes Female Yes 1
## 13 13 Yes Male No 0
## 14 14 Yes Female No 0
## 15 15 No Male Yes 1
## 16 16 No Female Yes 1
## 17 17 No Male No 0
## 18 18 No Male Yes 1

Summarised data:

mock_sumdf
## # A tibble: 4 × 4
## # Groups: Smoker [2]
## Smoker Sex Cancer Total
## <fct> <fct> <int> <int>
## 1 No Female 2 5
## 2 No Male 2 4
## 3 Yes Female 3 5
## 4 Yes Male 0 4
8/25

Representation of data for binary outcomes

Data:

mock_df
## # A tibble: 18 × 5
## Patient Smoker Sex Cancer CancerBinary
## <fct> <fct> <fct> <fct> <dbl>
## 1 1 Yes Female No 0
## 2 2 Yes Male No 0
## 3 3 No Female Yes 1
## 4 4 Yes Male No 0
## 5 5 Yes Female Yes 1
## 6 6 No Female No 0
## 7 7 Yes Female Yes 1
## 8 8 No Female No 0
## 9 9 No Female No 0
## 10 10 No Male No 0
## 11 11 Yes Male No 0
## 12 12 Yes Female Yes 1
## 13 13 Yes Male No 0
## 14 14 Yes Female No 0
## 15 15 No Male Yes 1
## 16 16 No Female Yes 1
## 17 17 No Male No 0
## 18 18 No Male Yes 1

Summarised data:

mock_sumdf
## # A tibble: 4 × 4
## # Groups: Smoker [2]
## Smoker Sex Cancer Total
## <fct> <fct> <int> <int>
## 1 No Female 2 5
## 2 No Male 2 4
## 3 Yes Female 3 5
## 4 Yes Male 0 4


  • The summarised data here give the same information as the original data, except you lost the patient number
  • Note the sample size, n, is larger than the number of rows in the summarised data
8/25

Logistic regression in R

  • Fitting logistic regression models in R depend on the form of input data
glm(Cancer ~ Smoker + Sex,
family = binomial(link = "logit"),
data = mock_df)
##
## Call: glm(formula = Cancer ~ Smoker + Sex, family = binomial(link = "logit"),
## data = mock_df)
##
## Coefficients:
## (Intercept) SmokerYes SexMale
## 0.2517 -0.5034 -1.1145
##
## Degrees of Freedom: 17 Total (i.e. Null); 15 Residual
## Null Deviance: 24.06
## Residual Deviance: 22.61 AIC: 28.61
glm(cbind(Cancer, Total - Cancer) ~ Smoker + Sex,
family = binomial(link = "logit"),
data = mock_sumdf)
##
## Call: glm(formula = cbind(Cancer, Total - Cancer) ~ Smoker + Sex, family = binomial(link = "logit"),
## data = mock_sumdf)
##
## Coefficients:
## (Intercept) SmokerYes SexMale
## 0.2517 -0.5034 -1.1145
##
## Degrees of Freedom: 3 Total (i.e. Null); 1 Residual
## Null Deviance: 5.052
## Residual Deviance: 3.604 AIC: 15.82
9/25

Simulating from a logistic regression model Part 1

  • Let's suppose that the probability of having cancer are the following:
    • 0.075 for women smokers
    • 0.045 for men smokers
    • 0.005 for women non-smokers
    • 0.003 for men non-smokers
  • We'll sample 500 people for each group
10/25

Simulating from a logistic regression model Part 1

  • Let's suppose that the probability of having cancer are the following:
    • 0.075 for women smokers
    • 0.045 for men smokers
    • 0.005 for women non-smokers
    • 0.003 for men non-smokers
  • We'll sample 500 people for each group
df <- tibble(id = 1:2000) %>%
mutate(Smoker = rep(c("Yes", "No"), each = n() / 2),
Sex = rep(c("Female", "Male"), times = n() / 2))
print(df, n = 4)
## # A tibble: 2,000 × 3
## id Smoker Sex
## <int> <chr> <chr>
## 1 1 Yes Female
## 2 2 Yes Male
## 3 3 Yes Female
## 4 4 Yes Male
## # … with 1,996 more rows
df %>%
group_by(Smoker, Sex) %>%
count()
## # A tibble: 4 × 3
## # Groups: Smoker, Sex [4]
## Smoker Sex n
## <chr> <chr> <int>
## 1 No Female 500
## 2 No Male 500
## 3 Yes Female 500
## 4 Yes Male 500
10/25

Simulating from a logistic regression model Part 1

  • Let's suppose that the probability of having cancer are the following:
    • 0.075 for women smokers
    • 0.045 for men smokers
    • 0.005 for women non-smokers
    • 0.003 for men non-smokers
  • We'll sample 500 people for each group
  • Remember that under the logistic regression model, we assumed that YiB(1,pi)
df <- tibble(id = 1:2000) %>%
mutate(Smoker = rep(c("Yes", "No"), each = n() / 2),
Sex = rep(c("Female", "Male"), times = n() / 2)) %>%
rowwise() %>%
mutate(CancerBinary =
case_when(Smoker=="Yes" & Sex=="Female" ~ rbinom(1, 1, 0.075),
Smoker=="Yes" & Sex=="Male" ~ rbinom(1, 1, 0.045),
Smoker=="No" & Sex=="Female" ~ rbinom(1, 1, 0.005),
Smoker=="No" & Sex=="Male" ~ rbinom(1, 1, 0.003)),
Cancer = ifelse(CancerBinary, "Yes", "No"))
df %>%
filter(Cancer=="Yes")
## # A tibble: 53 × 5
## # Rowwise:
## id Smoker Sex CancerBinary Cancer
## <int> <chr> <chr> <int> <chr>
## 1 27 Yes Female 1 Yes
## 2 32 Yes Male 1 Yes
## 3 47 Yes Female 1 Yes
## 4 83 Yes Female 1 Yes
## 5 129 Yes Female 1 Yes
## 6 136 Yes Male 1 Yes
## 7 149 Yes Female 1 Yes
## 8 218 Yes Male 1 Yes
## 9 245 Yes Female 1 Yes
## 10 251 Yes Female 1 Yes
## # … with 43 more rows
10/25

Simulating from a logistic regression model Part 2

  • At times, you may want to simulate the summary data directly instead of the individual data
11/25

Simulating from a logistic regression model Part 2

  • At times, you may want to simulate the summary data directly instead of the individual data

  • Recall that if YiB(1,p) for i=1,...k and Yis are independent, S=Y1+Y2+...+YkB(k,p)

11/25

Simulating from a logistic regression model Part 2

  • At times, you may want to simulate the summary data directly instead of the individual data
  • Recall that if YiB(1,p) for i=1,...k and Yis are independent, S=Y1+Y2+...+YkB(k,p)
expand_grid(Smoker = c("Yes", "No"), Sex = c("Female", "Male")) %>%
rowwise() %>%
mutate(Cancer =
case_when(Smoker=="Yes" & Sex=="Female" ~ rbinom(1, 500, 0.075),
Smoker=="Yes" & Sex=="Male" ~ rbinom(1, 500, 0.045),
Smoker=="No" & Sex=="Female" ~ rbinom(1, 500, 0.005),
Smoker=="No" & Sex=="Male" ~ rbinom(1, 500, 0.003)),
Total = 500)
## # A tibble: 4 × 4
## # Rowwise:
## Smoker Sex Cancer Total
## <chr> <chr> <int> <dbl>
## 1 Yes Female 35 500
## 2 Yes Male 23 500
## 3 No Female 0 500
## 4 No Male 3 500
11/25

Simulating from a logistic regression model Part 2

  • At times, you may want to simulate the summary data directly instead of the individual data
  • Recall that if YiB(1,p) for i=1,...k and Yis are independent, S=Y1+Y2+...+YkB(k,p)
expand_grid(Smoker = c("Yes", "No"), Sex = c("Female", "Male")) %>%
rowwise() %>%
mutate(Cancer =
case_when(Smoker=="Yes" & Sex=="Female" ~ rbinom(1, 500, 0.075),
Smoker=="Yes" & Sex=="Male" ~ rbinom(1, 500, 0.045),
Smoker=="No" & Sex=="Female" ~ rbinom(1, 500, 0.005),
Smoker=="No" & Sex=="Male" ~ rbinom(1, 500, 0.003)),
Total = 500)
## # A tibble: 4 × 4
## # Rowwise:
## Smoker Sex Cancer Total
## <chr> <chr> <int> <dbl>
## 1 Yes Female 35 500
## 2 Yes Male 23 500
## 3 No Female 0 500
## 4 No Male 3 500
11/25

Case study 1 Menarche

  • In 1965, the average age of 25 homogeneous groups of girls was recorded along with the number of girls who have reached menarche out of the total in each group.

data(menarche, package = "MASS")
skimr::skim(menarche)
## ── Data Summary ────────────────────────
## Values
## Name menarche
## Number of rows 25
## Number of columns 3
## _______________________
## Column type frequency:
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Age 0 1 13.1 2.03 9.21 11.6 13.1 14.6 17.6 ▅▇▇▇▁
## 2 Total 0 1 157. 195. 88 98 105 117 1049 ▇▁▁▁▁
## 3 Menarche 0 1 92.3 204. 0 10 51 92 1049 ▇▁▁▁▁
# can't fit the model below using geom_smooth
# so manually fit
menarche_model <- glm(cbind(Menarche, Total - Menarche) ~ Age,
data = menarche,
family = binomial(link = "logit"))
# length.out = 80 to match geom_smooth default
menarche_pred <- data.frame(Age = seq(min(menarche$Age),
max(menarche$Age),
length.out = 80))
menarche_pred <- menarche_pred %>%
mutate(fit = predict(menarche_model, newdata = menarche_pred, type = "response"))
ggplot(menarche, aes(Age, Menarche/Total)) +
geom_point() +
geom_line(data = menarche_pred, aes(y = fit), color = "blue")

Milicer, H. and Szczotka, F. (1966) Age at Menarche in Warsaw girls in 1965. Human Biology 38, 199–203.

12/25

Simulating data from a fitted logistic regression model Part 1

  • Suppose we want to simulate from the fitted model
  • We first fit the fitted model
fit1 <-
glm(cbind(Menarche, Total - Menarche) ~ Age,
family = "binomial",
data = menarche)
(beta <- coef(fit1))
## (Intercept) Age
## -21.226395 1.631968
  • The fitted regression model is given as: logit(ˆpi)=ˆβ0+ˆβ1xi1.
  • Rearranging we get ˆpi=11+e(ˆβ0+ˆβ1xi1).
  • Simulating from first principles:
menarche %>%
rowwise() %>%
mutate(
phat = 1/(1 + exp(-(beta[1] + beta[2] * Age))),
simMenarche = rbinom(1, Total, phat))
## # A tibble: 25 × 5
## # Rowwise:
## Age Total Menarche phat simMenarche
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 9.21 376 0 0.00203 1
## 2 10.2 200 0 0.0103 3
## 3 10.6 93 0 0.0187 2
## 4 10.8 120 2 0.0279 3
## 5 11.1 90 2 0.0413 1
## 6 11.3 88 5 0.0609 6
## 7 11.6 105 10 0.0888 9
## 8 11.8 111 17 0.128 12
## 9 12.1 100 16 0.181 17
## 10 12.3 93 29 0.249 23
## # … with 15 more rows
13/25

Simulating data from a fitted logistic regression model Part 2

  • An easier way to do this is to use the simulate function which works for many model objects in R
  • Below it's simulating 3 sets of responses (i.e. counts of "success" and "failure" events) from fit1 logistic model object
simulate(fit1, nsim = 3)
## sim_1.Menarche sim_1.V2 sim_2.Menarche sim_2.V2 sim_3.Menarche sim_3.V2
## 1 0 376 0 376 0 376
## 2 2 198 1 199 0 200
## 3 4 89 0 93 3 90
## 4 4 116 2 118 6 114
## 5 8 82 5 85 3 87
## 6 6 82 3 85 6 82
## 7 14 91 7 98 5 100
## 8 13 98 14 97 16 95
## 9 21 79 18 82 20 80
## 10 25 68 21 72 27 66
## 11 47 53 32 68 33 67
## 12 37 71 43 65 41 67
## 13 47 52 54 45 59 40
## 14 57 49 63 43 68 38
## 15 76 29 82 23 84 21
## 16 87 30 96 21 91 26
## 17 83 15 82 16 84 14
## 18 90 7 88 9 91 6
## 19 107 13 108 12 113 7
## 20 97 5 98 4 97 5
## 21 115 7 119 3 119 3
## 22 108 3 108 3 108 3
## 23 93 1 93 1 92 2
## 24 112 2 112 2 114 0
## 25 1048 1 1049 0 1048 1
14/25

Diagnostics for logistic regression models

  • One diagnostic is to compare the observed and expected proportions under the logistic regression fit.
df1 <- menarche %>%
mutate(
pexp = 1/(1 + exp(-(beta[1] + beta[2] * Age))),
pobs = Menarche / Total)

15/25

Diagnostics for logistic regression models

  • One diagnostic is to compare the observed and expected proportions under the logistic regression fit.
df1 <- menarche %>%
mutate(
pexp = 1/(1 + exp(-(beta[1] + beta[2] * Age))),
pobs = Menarche / Total)

  • Goodness-of-fit type test is used commonly to assess the fit as well.

  • E.g. Hosmer–Lemeshow test, where test statistic is given as

H=ri=1((O1iE1g)2E1i+(O0iE0g)2E0i) where O1i (E1i) and O0i (E0i) are observed (expected) frequencies for successful and non-successful events for group i, respectively.

vcdExtra::HLtest(fit1)
## Hosmer and Lemeshow Goodness-of-Fit Test
##
## Call:
## glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = "binomial",
## data = menarche)
## ChiSquare df P_value
## 0.1088745 8 0.9999996
15/25

Diagnostics for linear models

16/25

Assumptions for linear models

For i{1,...,n},

Yi=β0+β1xi1+...+βkxik+ϵi, where ϵiNID(0,σ2) or in matrix format,

\boldsymbol{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2 \mathbf{I}_n)

where
  • `\boldsymbol{Y} = (Y_1, ..., Y_n)^\top`,
  • `\boldsymbol{\beta} = (\beta_0, ..., \beta_k)^\top`,
  • `\boldsymbol{\epsilon} = (\epsilon_1, ..., \epsilon_n)^\top`, and
  • `\mathbf{X} = \begin{bmatrix}\boldsymbol{1}_n & \boldsymbol{x}_1 & ... & \boldsymbol{x}_k \end{bmatrix}`, where
  • `\boldsymbol{x}_j =(x_{1j}, ..., x_{nj})^\top` for `j \in \{1, ..., k\}`
17/25

Assumptions for linear models

For i \in \{1, ..., n\},

Y_i = \beta_0 + \beta_1x_{i1} + ... + \beta_{k}x_{ik} + \epsilon_i, where \color{red}{\epsilon_i \sim NID(0, \sigma^2)} or in matrix format,

\boldsymbol{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \color{red}{\boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2 \mathbf{I}_n)}

where
  • `\boldsymbol{Y} = (Y_1, ..., Y_n)^\top`,
  • `\boldsymbol{\beta} = (\beta_0, ..., \beta_k)^\top`,
  • `\boldsymbol{\epsilon} = (\epsilon_1, ..., \epsilon_n)^\top`, and
  • `\mathbf{X} = \begin{bmatrix}\boldsymbol{1}_n & \boldsymbol{x}_1 & ... & \boldsymbol{x}_k \end{bmatrix}`, where
  • `\boldsymbol{x}_j =(x_{1j}, ..., x_{nj})^\top` for `j \in \{1, ..., k\}`

This means that we assume

  1. E(\epsilon_i) = 0 for i \in \{1, ..., n\}.
  2. \epsilon_1, ..., \epsilon_n are independent.
  3. Var(\epsilon_i) = \sigma^2 for i \in \{1, ..., n\} (i.e. homogeneity).
  4. \epsilon_1, ..., \epsilon_n are normally distributed.
17/25

Assumptions for linear models

For i \in \{1, ..., n\},

Y_i = \beta_0 + \beta_1x_{i1} + ... + \beta_{k}x_{ik} + \epsilon_i, where \color{red}{\epsilon_i \sim NID(0, \sigma^2)} or in matrix format,

\boldsymbol{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \color{red}{\boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2 \mathbf{I}_n)}

where
  • `\boldsymbol{Y} = (Y_1, ..., Y_n)^\top`,
  • `\boldsymbol{\beta} = (\beta_0, ..., \beta_k)^\top`,
  • `\boldsymbol{\epsilon} = (\epsilon_1, ..., \epsilon_n)^\top`, and
  • `\mathbf{X} = \begin{bmatrix}\boldsymbol{1}_n & \boldsymbol{x}_1 & ... & \boldsymbol{x}_k \end{bmatrix}`, where
  • `\boldsymbol{x}_j =(x_{1j}, ..., x_{nj})^\top` for `j \in \{1, ..., k\}`

This means that we assume

  1. E(\epsilon_i) = 0 for i \in \{1, ..., n\}.
  2. \epsilon_1, ..., \epsilon_n are independent.
  3. Var(\epsilon_i) = \sigma^2 for i \in \{1, ..., n\} (i.e. homogeneity).
  4. \epsilon_1, ..., \epsilon_n are normally distributed.



So how do we check it?

17/25

Model diagnostics for linear models

Plot Y_i vs x_i to see if there is \approx a linear relationship between Y and x.

A boxplot of the residuals R_i to check for symmetry.

To check the homoscedasticity assumption, plot R_i vs x_i. There should be no obvious patterns.

A normal Q-Q plot, i.e. a plot of the ordered residuals vs \Phi^{-1}(\frac{i}{n+1}).

18/25

Assessing (A1) E(\epsilon_i) = 0 for i=1,\ldots,n

  • It is a property of the least squares method that \sum_{i=1}^n R_i = 0,\quad \text{so}\quad \bar R_i = 0 for R_i = Y_i - \hat{Y}_i, hence (A1) will always appear valid "overall".
19/25

Assessing (A1) E(\epsilon_i) = 0 for i=1,\ldots,n

  • It is a property of the least squares method that \sum_{i=1}^n R_i = 0,\quad \text{so}\quad \bar R_i = 0 for R_i = Y_i - \hat{Y}_i, hence (A1) will always appear valid "overall".
  • Trend in residual versus fitted values or covariate can indicate "local" failure of (A1).
19/25

Assessing (A1) E(\epsilon_i) = 0 for i=1,\ldots,n

  • It is a property of the least squares method that \sum_{i=1}^n R_i = 0,\quad \text{so}\quad \bar R_i = 0 for R_i = Y_i - \hat{Y}_i, hence (A1) will always appear valid "overall".
  • Trend in residual versus fitted values or covariate can indicate "local" failure of (A1).
  • What do you conclude from the following plots?

19/25

Assessing (A2)-(A3)

(A2) \epsilon_1, \ldots ,\epsilon_n are independent

  • If (A2) is correct, then residuals should appear randomly scattered about zero if plotted against fitted values or covariate.
  • Long sequences of positive residuals followed by sequences of negative residuals in R_i vs x_i plot suggests that the error terms are not independent.

(A3) Var(\epsilon_i) = \sigma^2 for i=1,\ldots,n

  • If (A3) holds then the spread of the residuals should be roughly the same across the fitted values or covariate.

20/25

Assessing (A4) \epsilon_1, \ldots ,\epsilon_n are normally distributed

Q-Q Plots

  • The function qqnorm(x) produces a Q-Q plot of the ordered vector x against the quantiles of the normal distribution.
  • The n chosen normal quantiles \Phi^{-1}(\frac{i}{n+1}) are easy to calculate but more sophisticated ways exist:
    • \frac{i}{n+1} \mapsto \frac{i-3/8}{n+1/4}, default in qqnorm.
    • \frac{i}{n+1} \mapsto \frac{i-1/3}{n+1/3}, recommended by Hyndman and Fan (1996).

In R

fit <- lm(y ~ x)

By "hand"

plot(qnorm((1:n) / (n + 1)), sort(resid(fit)))

By base

qqnorm(resid(fit))
qqline(resid(fit))

By ggplot2

data.frame(residual = resid(fit)) %>%
ggplot(aes(sample = residual)) +
stat_qq() + stat_qq_line(color="blue")

Reference: Hyndman and Fan (1996). Sample quantiles in statistical packages, American Statistician, 50, 361--365.

21/25

Examining simulated data

22/25

Examining simulated data

Simulation scheme

n <- 100
x <- seq(0, 1, length.out = n)
y1 <- x + rnorm(n) / 3 # Linear
y2 <- 3 * (x - 0.5) ^ 2 +
c(rnorm(n / 2)/3, rnorm(n / 2)/6) # Quadratic
y3 <- -0.25 * sin(20 * x - 0.2) +
x + rnorm(n) / 3 # Non-linear
M1 <- lm(y1 ~ x); M2 <- lm(y2 ~ x); M3 <- lm(y3 ~ x)

22/25

Take away messages

23/25

Take away messages

  • Parametric models assume some distribution in advance
23/25

Take away messages

  • Parametric models assume some distribution in advance
  • Logistic models can be used to model explanatory variables with binary outcomes
23/25

Take away messages

  • Parametric models assume some distribution in advance
  • Logistic models can be used to model explanatory variables with binary outcomes
  • You should be able to simulate from parametric models
23/25

Take away messages

  • Parametric models assume some distribution in advance
  • Logistic models can be used to model explanatory variables with binary outcomes
  • You should be able to simulate from parametric models
  • You can perform basic model diagnostics
23/25

Take away messages

  • Parametric models assume some distribution in advance
  • Logistic models can be used to model explanatory variables with binary outcomes
  • You should be able to simulate from parametric models
  • You can perform basic model diagnostics
  • You can use simulation to analyse model properties
23/25

Resources and Acknowledgement

24/25

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 8 - Session 1


25/25

ETC5521: Exploratory Data Analysis


Sculpting data using models, checking assumptions, co-dependency and performing diagnostics

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 8 - Session 1


1/25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow