## ETC5523: Communicating with Data

### Statistical model outputs

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

• emi.tanaka@monash.edu

• Week 5A

Aim

• Extract information from model objects
• Understand and create functions in R
• Understand and apply S3 object-oriented programming in R

Why

• Working with model objects is necessary for you to get the information you need for communication
• These concepts will be helpful later when we start developing R-packages

## 📈 Statistical models

• All models are approximations of the unknown data generating process
• How good of an approximation depends on the collected data and the model choice

🎯 Characterise mpg in terms of wt.

• We fit the model:

$\texttt{mpg}_i = \beta_0 + \beta_1\texttt{wt}_i + e_i$

Parameter details
• $\texttt{mpg}_i$ is the miles per gallon of the $i$-th car,
• $\texttt{wt}_i$ is the weight of the $i$-th car,
• $\beta_0$ is the intercept,
• $\beta_1$ is the slope, and
• $e_i$ is random error, usually assumed $e_i \sim NID(0, \sigma^2)$

## Fitting linear models in R

$\texttt{mpg}_i = \beta_0 + \beta_1\texttt{wt}_i + e_i$

In R we fit this as

fit <- lm(mpg ~ wt, data = mtcars)

which is the same as

fit <- lm(mpg ~ 1 + wt, data = mtcars)
fit

Call:
lm(formula = mpg ~ 1 + wt, data = mtcars)

Coefficients:
(Intercept)           wt
37.285       -5.344  

$\hat{\beta}_0 = 37.285$ and $\hat{\beta}_1 = -5.344$

## ℹ️ Extracting information from the fitted model

• When you fit a model, there would be a number of information you will be interested in extracting from the fit including:
• the model parameter estimates,
• model-related summary statistics, e.g. $R^2$, AIC and BIC,
• model-related values, e.g. residuals, fitted values and predictions.
• So how do you extract these values from the fit?
• What does fit even contain?

## ℹ️ Extracting information from the fitted model

str(fit)
List of 12
$coefficients : Named num [1:2] 37.29 -5.34 ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"$ residuals    : Named num [1:32] -2.28 -0.92 -2.09 1.3 -0.2 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$effects : Named num [1:32] -113.65 -29.116 -1.661 1.631 0.111 ... ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "" "" ...$ rank         : int 2
$fitted.values: Named num [1:32] 23.3 21.9 24.9 20.1 18.9 ... ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...$ assign       : int [1:2] 0 1
$qr :List of 5 ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$: chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ... .. .. ..$ : chr [1:2] "(Intercept)" "wt"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$qraux: num [1:2] 1.18 1.05 ..$ pivot: int [1:2] 1 2
..$tol : num 1e-07 ..$ rank : int 2
..- attr(*, "class")= chr "qr"
$df.residual : int 30$ xlevels      : Named list()
$call : language lm(formula = mpg ~ 1 + wt, data = mtcars)$ terms        :Classes 'terms', 'formula'  language mpg ~ 1 + wt
.. ..- attr(*, "variables")= language list(mpg, wt)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$: chr [1:2] "mpg" "wt" .. .. .. ..$ : chr "wt"
.. ..- attr(*, "term.labels")= chr "wt"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(mpg, wt)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
$model :'data.frame': 32 obs. of 2 variables: ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
..$wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ... ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ 1 + wt .. .. ..- attr(*, "variables")= language list(mpg, wt) .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
.. .. .. .. ..$: chr "wt" .. .. ..- attr(*, "term.labels")= chr "wt" .. .. ..- attr(*, "order")= int 1 .. .. ..- attr(*, "intercept")= int 1 .. .. ..- attr(*, "response")= int 1 .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> .. .. ..- attr(*, "predvars")= language list(mpg, wt) .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt" - attr(*, "class")= chr "lm" ## ℹ️ Extracting information from the fitted model Accessing model parameter estimates: fit$coefficients
(Intercept)          wt
37.285126   -5.344472 
# OR using
coef(fit)
(Intercept)          wt
37.285126   -5.344472 

This gives us the estimates of $\beta_0$ and $\beta_1$.

But what about $\sigma^2$? Recall $e_i \sim NID(0, \sigma^2)$.

sigma(fit)^2
 9.277398

## ℹ️ Extracting information from the fitted model

You can also get a summary of the model object:

summary(fit)

Call:
lm(formula = mpg ~ 1 + wt, data = mtcars)

Residuals:
Min      1Q  Median      3Q     Max
-4.5432 -2.3647 -0.1252  1.4096  6.8727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

So how do I extract these summary values out?

## Model objects to tidy data

broom::tidy(fit)
# A tibble: 2 × 5
term        estimate std.error statistic  p.value
<chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.3      1.88      19.9  8.24e-19
2 wt             -5.34     0.559     -9.56 1.29e-10
broom::glance(fit)
# A tibble: 1 × 12
r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
<dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.753        0.745  3.05    91.4 1.29e-10     1  -80.0  166.  170.    278.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
# ℹ Use colnames() to see all variable names
broom::augment(fit)
# A tibble: 32 × 9
.rownames           mpg    wt .fitted .resid   .hat .sigma   .cooksd .std.r…¹
<chr>             <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>    <dbl>
1 Mazda RX4          21    2.62    23.3 -2.28  0.0433   3.07 0.0133     -0.766
2 Mazda RX4 Wag      21    2.88    21.9 -0.920 0.0352   3.09 0.00172    -0.307
3 Datsun 710         22.8  2.32    24.9 -2.09  0.0584   3.07 0.0154     -0.706
4 Hornet 4 Drive     21.4  3.22    20.1  1.30  0.0313   3.09 0.00302     0.433
5 Hornet Sportabout  18.7  3.44    18.9 -0.200 0.0329   3.10 0.0000760  -0.0668
6 Valiant            18.1  3.46    18.8 -0.693 0.0332   3.10 0.000921   -0.231
7 Duster 360         14.3  3.57    18.2 -3.91  0.0354   3.01 0.0313     -1.31
8 Merc 240D          24.4  3.19    20.2  4.16  0.0313   3.00 0.0311      1.39
9 Merc 230           22.8  3.15    20.5  2.35  0.0314   3.07 0.00996     0.784
10 Merc 280           19.2  3.44    18.9  0.300 0.0329   3.10 0.000171    0.100
# … with 22 more rows, and abbreviated variable name ¹​.std.resid
# ℹ Use print(n = ...) to see more rows

But how do these functions work?

# Functions in

Revise about functions at Learn R

## Functions in R

• Functions can be broken into three components:

• formals(), the list of arguments,
• body(), the code inside the function, and
• environment()1.
• Functions in R are created using function() with binding to a name using <- or =

## Functions in R

f1 <- function(x) sum(x) / length(x)
formals(f1)
$x body(f1) sum(x)/length(x) environment(f1) <environment: R_GlobalEnv> ## Function Example 1 f1 <- function(x) sum(x) / length(x) x1 <- c(1, 1, 2, 2) f1(x1)  1.5 What if there are missing values in the vector or the values are dates? x2 <- c(1, 1, 2, 2, NA) f1(x2)  NA x3 <- as.Date(c("2021-08-04", "2021-08-11")) f1(x3) Error in Summary.Date(structure(c(18843, 18850), class = "Date"), na.rm = FALSE): sum not defined for "Date" objects ## Function Example 2 f2 <- function(x, na.rm = TRUE) { n <- sum(!is.na(x)) sum(x, na.rm = na.rm) / n } f2(x1)  1.5 f2(x2)  1.5 f2(x3) Error in Summary.Date(structure(c(18843, 18850), class = "Date"), na.rm = TRUE): sum not defined for "Date" objects ## Function Example 3 f3 <- function(x, na.rm = TRUE) { n <- sum(!is.na(x)) out <- sum(as.numeric(x), na.rm = na.rm) / n if(class(x)=="Date") { return(as.Date(out, origin = "1970-01-01")) } out } f3(x1)  1.5 f3(x2)  1.5 f3(x3)  "2021-08-07" • What about for another object class? x4 <- as.POSIXct(c("2021-08-11 18:00", "2021-08-11 20:00"), tz = "UTC") ## S3 Object oriented programming (OOP) • The S3 system is the most widely used OOP system in R but there are other OOP systems in R, e.g. the S4 system is used for model objects in lme4 R-package, but it will be out of scope for this unit class(x1)  "numeric" class(x2)  "numeric" class(x3)  "Date" class(x4)  "POSIXct" "POSIXt"  ## S3 Object oriented programming (OOP) • Here I create a generic called f4: f4 <- function(x, ...) UseMethod("f4") • And an associated default method: f4.default <- function(x, na.rm = TRUE) { sum(x, na.rm = na.rm) / sum(!is.na(x)) } • And an associated specific method for the Date class: f4.Date <- function(x, na.rm = TRUE) { out <- f4.default(as.numeric(x), na.rm = na.rm) as.Date(out, origin = "1970-01-01") } ## S3 Object oriented programming (OOP) f4(x1)  1.5 f4(x2)  1.5 f4(x3)  "2021-08-07" class(x4)  "POSIXct" "POSIXt"  f4.POSIXct <- function(x, na.rm = TRUE) { out <- f4.default(as.numeric(x), na.rm = na.rm) as.POSIXct(out, tz = attr(x, "tzone"), origin = "1970-01-01") } f4(x4)  "2021-08-11 19:00:00 UTC" ## S3 Object oriented programming (OOP) • A method is created by using the form generic.class. • When using a method for class, you can omit the .class from the function. • E.g. f4(x4) is the same as f4.POSIXct(x4) since the class of x4 is POSIXct (and POSIXt). • But notice f4.numeric doesn’t exist, instead there is f4.default. • default is a special class and when a generic doesn’t have a method for the corresponding class, it falls back to generic.default # Working with model objects in ## Modelling in R • There are many R-packages that fit all kinds of models, e.g. • mgcv fits generalized additive models, • rstanarm fits Bayesian regression models using Stan, and • fable fits forecast models, • many other contributions by the community. • There are a lot of new R-packages contributed — some implementing the latest research results. • This means that if you want to use the state-of-the-art research, then you need to work with model objects beyond the standard lm and glm. ## Example with Bayesian regression library(rstanarm) fit_stan <- stan_lm(mpg ~ 1 + wt, data = mtcars, prior = R2(0.7528, what = "mean"))  SAMPLING FOR MODEL 'lm' NOW (CHAIN 1). Chain 1: Chain 1: Gradient evaluation took 5.2e-05 seconds Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.52 seconds. Chain 1: Adjust your expectations accordingly! Chain 1: Chain 1: Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup) Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup) Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup) Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup) Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup) Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup) Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling) Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling) Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling) Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling) Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling) Chain 1: Iteration: 2000 / 2000 [100%] (Sampling) Chain 1: Chain 1: Elapsed Time: 0.168063 seconds (Warm-up) Chain 1: 0.246159 seconds (Sampling) Chain 1: 0.414222 seconds (Total) Chain 1: SAMPLING FOR MODEL 'lm' NOW (CHAIN 2). Chain 2: Chain 2: Gradient evaluation took 1e-05 seconds Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds. Chain 2: Adjust your expectations accordingly! Chain 2: Chain 2: Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup) Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup) Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup) Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup) Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup) Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup) Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling) Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling) Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling) Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling) Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling) Chain 2: Iteration: 2000 / 2000 [100%] (Sampling) Chain 2: Chain 2: Elapsed Time: 0.183966 seconds (Warm-up) Chain 2: 0.121704 seconds (Sampling) Chain 2: 0.30567 seconds (Total) Chain 2: SAMPLING FOR MODEL 'lm' NOW (CHAIN 3). Chain 3: Chain 3: Gradient evaluation took 1.1e-05 seconds Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds. Chain 3: Adjust your expectations accordingly! Chain 3: Chain 3: Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup) Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup) Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup) Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup) Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup) Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup) Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling) Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling) Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling) Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling) Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling) Chain 3: Iteration: 2000 / 2000 [100%] (Sampling) Chain 3: Chain 3: Elapsed Time: 0.193221 seconds (Warm-up) Chain 3: 0.112252 seconds (Sampling) Chain 3: 0.305473 seconds (Total) Chain 3: SAMPLING FOR MODEL 'lm' NOW (CHAIN 4). Chain 4: Chain 4: Gradient evaluation took 1e-05 seconds Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds. Chain 4: Adjust your expectations accordingly! Chain 4: Chain 4: Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup) Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup) Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup) Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup) Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup) Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup) Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling) Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling) Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling) Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling) Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling) Chain 4: Iteration: 2000 / 2000 [100%] (Sampling) Chain 4: Chain 4: Elapsed Time: 0.131779 seconds (Warm-up) Chain 4: 0.120024 seconds (Sampling) Chain 4: 0.251803 seconds (Total) Chain 4:  broom::tidy(fit_stan) Error in warn_on_stanreg(x): The supplied model object seems to be outputted from the rstanarm package. Tidiers for mixed model output now live in the broom.mixed package. Huh? ## S3 Object classes • So how do you find out the functions that work with model objects? • First notice the class of the object fit: class(fit)  "lm" • The methods associated with this can be found using: methods(class = "lm")   add1 alias anova augment case.names  coerce confint cooks.distance deviance dfbeta  dfbetas drop1 dummy.coef effects extractAIC  family formula fortify glance hatvalues  influence initialize kappa labels logLik  model.frame model.matrix nobs plot predict  print proj qqnorm qr residuals  rstandard rstudent show simulate slotsFromS3  summary tidy variable.names vcov see '?methods' for accessing help and source code Where is coef()? ## Case study broom::tidy class(fit_stan)  "stanreg" "glm" "lm"  broom::tidy function (x, ...) { UseMethod("tidy") } <bytecode: 0x7fcf4cb44ae0> <environment: namespace:generics> There is no tidy.stanreg method so uses the broom:::tidy.glm instead. ## Case study broom::tidy library(broom) tidy.stanreg <- function(x, ...) { est <- x$coefficients
tibble(term = names(est),
estimate = unname(est),
std.error = x\$ses)
}

tidy(fit_stan)
# A tibble: 2 × 3
term        estimate std.error
<chr>          <dbl>     <dbl>
1 (Intercept)    30.4      1.34
2 wt             -3.20     0.347

# Working with model objects

• Is this only for R though?
• How do you work with model objects in general?

## Python

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array(r.mtcars['wt']).reshape(-1, 1)
y = np.array(r.mtcars['mpg'])
model = LinearRegression().fit(x, y)

[model.intercept_, model.coef_]
[37.28512616734204, array([-5.34447157])]

(Intercept)          wt
37.285126   -5.344472 

## Week 5A Lesson

Summary

• Model objects are usually a list returning multiple output from the model fit
• When working with model objects, check the object structure and find the methods associated with it (and of course check the documentation)
• You should be able to work (or at least know how to get started) with all sort of model objects
• We revised how to create functions in R
• We applied S3 object-oriented programming in R 