ETC3250/5250

Introduction to Machine Learning

Non-parameteric regression

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

Regression models

  • Regression models propose y_i = f(x_{i1}, x_{i2}, ..., x_{ip}) + e_i where the goal is to estimate the function f.
  • Methods for estimation:
    • parametric - assumes model takes a specific form to data
    • non-parametric - no or less specific assumptions of the functional form to all data
    • semi-parametric - (not covered in this unit) combines parametric and non-parametric methods

Why non-parametric regression?

  • In a parametric regression, some type of distribution is assumed in advance.
  • Therefore, fitted parametric regression models can lead to fitting a smooth curve that misrepresents the data.
  • Non-parametric regression works for well to fitting a line to a scatter plot where noisy data values, sparse data points or weak inter-relationships interfere with your ability to see a line of best fit.
  • A drawback of non-parametric regressions is that it does not produce a functional form of the fitted model.

Non-parametric regression methods

  • Some methods:
    • local regression: sliding window with regression fitted to subsets
    • step functions: cut the range of a predictor into distinct regions
    • regression splines: combine polynomials and step functions to different subsets of a predictor.
  • These methods offer a lot of flexibility, while maintaining the ease and interpretability of linear models.

US economic time series

This dataset is available from http://research.stlouisfed.org/fred2.

Code
library(tidyverse)
ggplot(economics, aes(date, uempmed)) + 
  geom_point() +
  labs(x = "Date", y = "Median unemployment duration")

A parametric approach

  • The curve below is a fit of the polynomial model of order 27:

\color{#006DAE}{y_i = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i1}^2 + \cdots + \beta_{27}x_{i1}^{27} + e_i}

Code
ggplot(economics, aes(date, uempmed)) + 
  geom_point() +
  geom_smooth(method = stats::lm, 
              formula = y ~ poly(x, 27),
              color = "#006DAE") +
  labs(x = "Date", y = "Median unemployment\nduration")

Local regression

Local regression

  • LOESS (LOcal regrESSion) and LOWESS (LOcally WEighted Scatterplot Smoothing) are non-parametric regression methods (LOESS is a generalisation of LOWESS).
  • LOESS fits a weighted low order polynomial model to a subset of neighbouring data.
  • A user specified “bandwidth”, “smoothing parameter” or “span” \alpha determines how much of the data is used to fit each local polynomial model.
  • Large \alpha produce a smoother fit.
  • Small \alpha overfits the data with the fitted regression capturing the random error in the data.

How to fit LOESS curves in R?

The model can be fitted using the loess function where

  • the default span is 0.75 and
  • the default local polynomial degree is 2.
fit <- loess(uempmed ~ as.numeric(date),
             data = economics, 
             span = 0.75, 
             degree = 2) 
fit
Call:
loess(formula = uempmed ~ as.numeric(date), data = economics, 
    span = 0.75, degree = 2)

Number of Observations: 574 
Equivalent Number of Parameters: 4.34 
Residual Standard Error: 2.291 

Showing LOESS on the plot in R

In ggplot, you can add the loess using geom_smooth with method = loess and method arguments passed as list:

ggplot(economics, aes(date, uempmed)) +
  geom_point() + 
  geom_smooth(method = loess, 
              method.args = list(span = 0.75, 
                                 degree = 2)) 

How span changes the loess fit

How loess works

Step functions

Step functions

  • The idea of a step function is to cut up the range of a predictor into distinct regions.
  • This essentially converts a continuous predictor into an ordered categorical variable.
  • We don’t normally use this idea alone!
ggplot(economics, aes(date, uempmed)) +
  geom_point() + 
  geom_smooth(method = stats::lm,
              formula = y ~ cut_number(x, 4)) 

Regression splines

Splines (mechanical)

  • A wooden or metal strip that fits into another part of a machine to make it rotate.

Splines (drawing)

  • A thin, long wood or metal to draw smooth curves.

Splines (mathematical)

  • A spline is a piecewise polynomial function where
    • each piece corresponds to a disjoint subinterval that makes up the range of the variable, and
    • the function output is the same values at the subinterval boundaries.
  • The boundaries of the subintervals are called knots.
  • The smoothness of a spline is based on adjacent polynomial pieces sharing common derivative values or up to a certain order.
  • The simplest spline consists of step functions (but step functions are not necessary a spline).

Basis spline (B-spline)

  • All possible splines can be constructed from a linear combination of B-splines.

Breakdown of B-spline regression

Fitting B-splines with R

  • In R, we can get the basis splines using splines::bs.
  • The degree of freedom (df) is the number of knots plus one.
ggplot(economics, aes(date, uempmed)) +
  geom_point() + 
  geom_smooth(method = stats::lm,
              formula = y ~ splines::bs(x, df = 20)) 

Natural cubic splines

  • Natural cubic splines is a spline with degree 3 such that the second derivative is zero at the boundaries (i.e. is a linear function at the boundaries).

Breakdown of natural cubic spline regression

More knots results in overfitting

Code
map_dfr(c(0, 2, 3, 8, 14, 48),
        ~ economics %>%
          mutate(
            nknot = .x,
            nknot_label = paste(nknot, "knots") %>% fct_inorder()
          )) %>%
  ggplot(aes(date, uempmed, nknot = nknot)) +
  geom_point() +
  geom_smooth(
    method = lm,
    formula = y ~ splines::ns(x, df = nknot + 1),
    se = FALSE,
    colour = "orangered3"
  ) +
  facet_wrap(~ nknot_label) +
  ggtitle("Natural cubic splines")

Takeaways

  • Non-parametric regression is useful in data exploration and analysis although parameters must be carefully chosen not to overfit the data.