These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .
Press the right arrow to progress to the next slide!
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 8 - Session 2
We can also define an outlier as a data point that emanates from a different model than do the rest of the data.
We can also define an outlier as a data point that emanates from a different model than do the rest of the data.
Would you consider the yellow points below as outliers?
This is not always easy to detect!
Here, a point labelled P has x_1 and x_2 coordinates well within their respective ranges but P is not close to the observed sample values in 2-dimensional space.
In higher dimensions this type of behaviour is even harder to detect but we need to be on guard against extrapolating to extreme values.
Data collected from advertising experiment to study the impact of three advertising medias (youtube, facebook and newspaper) on sales.
data(marketing, package="datarium")skimr::skim(marketing)
## ── Data Summary ────────────────────────## Values ## Name marketing## Number of rows 200 ## Number of columns 4 ## _______________________ ## Column type frequency: ## numeric 4 ## ________________________ ## Group variables None ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 youtube 0 1 176. 103. 0.84 89.2 180. 263. 356. ▇▆▆▇▆## 2 facebook 0 1 27.9 17.8 0 12.0 27.5 43.8 59.5 ▇▆▆▆▆## 3 newspaper 0 1 36.7 26.1 0.36 15.3 30.9 54.1 137. ▇▆▃▁▁## 4 sales 0 1 16.8 6.26 1.92 12.4 15.5 20.9 32.4 ▁▇▇▅▂
GGally::ggpairs(marketing, progress=F)
broom::augment
..hat
is the leverage value.std.resid
is the studentised residual.cooksd
is the Cook's distancefit <- lm(sales ~ youtube * facebook, data = marketing)(out <- broom::augment(fit))
## # A tibble: 200 × 9## sales youtube facebook .fitted .resid .hat .sigma .cooksd .std.resid## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 26.5 276. 45.4 26.0 0.496 0.0174 1.13 0.000864 0.442 ## 2 12.5 53.4 47.2 12.8 -0.281 0.0264 1.13 0.000431 -0.252 ## 3 11.2 20.6 55.1 11.1 0.0465 0.0543 1.14 0.0000256 0.0423## 4 22.2 182. 49.6 21.2 1.04 0.0124 1.13 0.00268 0.923 ## 5 15.5 217. 13.0 15.2 0.316 0.0104 1.13 0.000207 0.280 ## 6 8.64 10.4 58.7 10.5 -1.91 0.0709 1.13 0.0583 -1.75 ## 7 14.2 69 39.4 13.0 1.15 0.0149 1.13 0.00395 1.02 ## 8 15.8 144. 23.5 14.6 1.23 0.00577 1.13 0.00173 1.09 ## 9 5.76 10.3 2.52 8.39 -2.63 0.0553 1.12 0.0838 -2.39 ## 10 12.7 240. 3.12 13.4 -0.727 0.0219 1.13 0.00236 -0.649 ## # … with 190 more rows
loess
function in R
span=0.75
) where \lambda is the degree of the local polynomial (default degree=2
) and n is the number of observations.This dataset was produced from US economic time series data available from http://research.stlouisfed.org/fred2.
data(economics, package = "ggplot2")skimr::skim(economics)
## ── Data Summary ────────────────────────## Values ## Name economics## Number of rows 574 ## Number of columns 6 ## _______________________ ## Column type frequency: ## Date 1 ## numeric 5 ## ________________________ ## Group variables None ## ## ── Variable type: Date ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate min max median n_unique## 1 date 0 1 1967-07-01 2015-04-01 1991-05-16 574## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 pce 0 1 4820. 3557. 507. 1578. 3937. 7626. 12194. ▇▅▃▂▃## 2 pop 0 1 257160. 36682. 198712 224896 253060 290291. 320402. ▇▇▆▆▇## 3 psavert 0 1 8.57 2.96 2.2 6.4 8.4 11.1 17.3 ▃▇▆▅▁## 4 uempmed 0 1 8.61 4.11 4 6 7.5 9.1 25.2 ▇▃▁▁▁## 5 unemploy 0 1 7771. 2642. 2685 6284 7494 8686. 15352 ▃▇▆▂▁
ggplot(economics, aes(date, uempmed)) + geom_point() + geom_smooth(method = loess, se = FALSE, method.args = list(span = 0.1)) + labs(x = "Date", y = "Median unemployment duration")
The model can be fitted using the loess
function where
fit <- economics %>% mutate(index = 1:n()) %>% loess(uempmed ~ index, data = ., span = 0.75, degree = 2)
The model can be fitted using the loess
function where
fit <- economics %>% mutate(index = 1:n()) %>% loess(uempmed ~ index, data = ., span = 0.75, degree = 2)
In ggplot
, you can add the loess using geom_smooth
with method = loess
and method arguments passed as list:
ggplot(economics, aes(date, uempmed)) + geom_point() + geom_smooth(method = loess, method.args = list(span = 0.75, degree = 2))
Fitting a line to a scatter plot where noisy data values, sparse data points or weak inter-relationships interfere with your ability to see a line of best fit.
Linear regression where least squares fitting doesn't create a line of good fit or is too labour intensive to use.
Fitting a line to a scatter plot where noisy data values, sparse data points or weak inter-relationships interfere with your ability to see a line of best fit.
Linear regression where least squares fitting doesn't create a line of good fit or is too labour intensive to use.
Data exploration and analysis.
Fitting a line to a scatter plot where noisy data values, sparse data points or weak inter-relationships interfere with your ability to see a line of best fit.
Linear regression where least squares fitting doesn't create a line of good fit or is too labour intensive to use.
Data exploration and analysis.
Recall: In a parametric regression, some type of distribution is assumed in advance; therefore fitted model can lead to fitting a smooth curve that misrepresents the data.
Fitting a line to a scatter plot where noisy data values, sparse data points or weak inter-relationships interfere with your ability to see a line of best fit.
Linear regression where least squares fitting doesn't create a line of good fit or is too labour intensive to use.
Data exploration and analysis.
Recall: In a parametric regression, some type of distribution is assumed in advance; therefore fitted model can lead to fitting a smooth curve that misrepresents the data.
In those cases, non-parametric regression may be a better choice.
Fitting a line to a scatter plot where noisy data values, sparse data points or weak inter-relationships interfere with your ability to see a line of best fit.
Linear regression where least squares fitting doesn't create a line of good fit or is too labour intensive to use.
Data exploration and analysis.
Recall: In a parametric regression, some type of distribution is assumed in advance; therefore fitted model can lead to fitting a smooth curve that misrepresents the data.
In those cases, non-parametric regression may be a better choice.
Can you think of where it might be useful?
Data were collected on length (in mm) and the age (in years) of 78 bluegills captured from Lake Mary, Minnesota in 1981.
Which fit looks better?
bg_df <- read.table(here::here("data/bluegills.txt"), header = TRUE)skimr::skim(bg_df)
## ── Data Summary ────────────────────────## Values## Name bg_df ## Number of rows 78 ## Number of columns 2 ## _______________________ ## Column type frequency: ## numeric 2 ## ________________________ ## Group variables None ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 age 0 1 3.63 0.927 1 3 4 4 6 ▂▃▇▂▁## 2 length 0 1 144. 24.1 62 137. 150 160 188 ▁▁▂▇▂
ggplot(bg_df, aes(age, length)) + geom_point() + geom_smooth(method = lm, se = FALSE) + labs(tag = "(A)", title = "Linear regression", x = "Age (in years)", y = "Length (in mm)")ggplot(bg_df, aes(age, length)) + geom_point() + geom_smooth(method = lm, se = FALSE, formula = y ~ poly(x, 2)) + labs(tag = "(B)", title = "Quadratic regression", x = "Age (in years)", y = "Length (in mm)")
Weisberg (1986) A linear model approach to backcalculation of fish length, Journal of the American Statistical Association 81 (196) 922-929
fit1 <- lm(length ~ age, data = bg_df)fit2 <- lm(length ~ poly(age, 2), data = bg_df)df1 <- augment(fit1)df2 <- mutate(augment(fit2), age = bg_df$age)summary(fit1)
## ## Call:## lm(formula = length ~ age, data = bg_df)## ## Residuals:## Min 1Q Median 3Q Max ## -26.523 -7.586 0.258 10.102 20.414 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 62.649 5.755 10.89 <2e-16 ***## age 22.312 1.537 14.51 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 12.51 on 76 degrees of freedom## Multiple R-squared: 0.7349, Adjusted R-squared: 0.7314 ## F-statistic: 210.7 on 1 and 76 DF, p-value: < 2.2e-16
summary(fit2)
## ## Call:## lm(formula = length ~ poly(age, 2), data = bg_df)## ## Residuals:## Min 1Q Median 3Q Max ## -19.846 -8.321 -1.137 6.698 22.098 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 143.603 1.235 116.290 < 2e-16 ***## poly(age, 2)1 181.565 10.906 16.648 < 2e-16 ***## poly(age, 2)2 -54.517 10.906 -4.999 3.67e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 10.91 on 75 degrees of freedom## Multiple R-squared: 0.8011, Adjusted R-squared: 0.7958 ## F-statistic: 151.1 on 2 and 75 DF, p-value: < 2.2e-16
ggplot(df1, aes(age, .std.resid)) + geom_point() + geom_hline(yintercept = 0) + labs(x = "Age", y = "Residual", tag = "(A)", title = "Linear regression")ggplot(df2, aes(age, .std.resid)) + geom_point() + geom_hline(yintercept = 0) + labs(x = "Age", y = "Residual", tag = "(B)", title = "Quadratic regression")
Weisberg (1986) A linear model approach to backcalculation of fish length, Journal of the American Statistical Association 81 (196) 922-929
The structure is easily visible with the LOESS curve:
fit1 <- lm(length ~ age, data = bg_df)fit2 <- lm(length ~ poly(age, 2), data = bg_df)df1 <- augment(fit1)df2 <- mutate(augment(fit2), age = bg_df$age)summary(fit1)
## ## Call:## lm(formula = length ~ age, data = bg_df)## ## Residuals:## Min 1Q Median 3Q Max ## -26.523 -7.586 0.258 10.102 20.414 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 62.649 5.755 10.89 <2e-16 ***## age 22.312 1.537 14.51 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 12.51 on 76 degrees of freedom## Multiple R-squared: 0.7349, Adjusted R-squared: 0.7314 ## F-statistic: 210.7 on 1 and 76 DF, p-value: < 2.2e-16
summary(fit2)
## ## Call:## lm(formula = length ~ poly(age, 2), data = bg_df)## ## Residuals:## Min 1Q Median 3Q Max ## -19.846 -8.321 -1.137 6.698 22.098 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 143.603 1.235 116.290 < 2e-16 ***## poly(age, 2)1 181.565 10.906 16.648 < 2e-16 ***## poly(age, 2)2 -54.517 10.906 -4.999 3.67e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 10.91 on 75 degrees of freedom## Multiple R-squared: 0.8011, Adjusted R-squared: 0.7958 ## F-statistic: 151.1 on 2 and 75 DF, p-value: < 2.2e-16
ggplot(df1, aes(age, .std.resid)) + geom_point() + geom_hline(yintercept = 0) + labs(x = "Age", y = "Residual", tag = "(A)", title = "Linear regression") + geom_smooth(method = loess, color = "red", se = FALSE)ggplot(df2, aes(age, .std.resid)) + geom_point() + geom_hline(yintercept = 0) + labs(x = "Age", y = "Residual", tag = "(B)", title = "Quadratic regression") + geom_smooth(method = loess, color = "red", se = FALSE)
Weisberg (1986) A linear model approach to backcalculation of fish length, Journal of the American Statistical Association 81 (196) 922-929
This data contains measurement of soil resistivity of an agricultural field.
data(cleveland.soil, package = "agridat")skimr::skim(cleveland.soil)
## ── Data Summary ────────────────────────## Values ## Name cleveland.soil## Number of rows 8641 ## Number of columns 5 ## _______________________ ## Column type frequency: ## logical 1 ## numeric 4 ## ________________________ ## Group variables None ## ## ── Variable type: logical ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean count ## 1 is.ns 0 1 0.242 FAL: 6553, TRU: 2088## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 northing 0 1 1.90 1.11 -0.01 0.978 1.81 2.91 3.81 ▆▇▅▇▆## 2 easting 0 1 0.739 0.429 -0.004 0.362 0.729 1.10 1.56 ▆▇▆▆▅## 3 resistivity 0 1 50.9 28.8 0.89 29.6 47.8 71.0 166. ▇▇▅▁▁## 4 track 0 1 16.9 12.4 1 5 14 29 40 ▇▃▂▃▃
ggplot(cleveland.soil, aes(easting, northing)) + geom_point()library(lattice)cloud(resistivity ~ easting * northing, pch = ".", data = cleveland.soil)
library(lattice)xyplot(resistivity ~ northing | equal.count(easting, 12), data = cleveland.soil, cex = 0.2, type = c("p", "smooth"), col.line = "red", col = "gray", lwd = 2)
ggplot2
ggplot2
where the panels have overlapping observations is tricky.easting
:ggplot(cleveland.soil, aes(northing, resistivity)) + geom_point(color = "gray") + geom_smooth(method = "loess", color = "red", se = FALSE) + facet_wrap(~ cut_number(easting, 12))
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 8 - Session 2
Lecturer: Emi Tanaka
ETC5521.Clayton-x@monash.edu
Week 8 - Session 2
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |