## ETC3250/5250

Introduction to Machine Learning

### Non-parameteric regression

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

## Regression models

• Regression models propose y_i = f(x_{i1}, x_{i2}, ..., x_{ip}) + e_i where the goal is to estimate the function f.
• Methods for estimation:
• parametric - assumes model takes a specific form to data
• non-parametric - no or less specific assumptions of the functional form to all data
• semi-parametric - (not covered in this unit) combines parametric and non-parametric methods

## Why non-parametric regression?

• In a parametric regression, some type of distribution is assumed in advance.
• Therefore, fitted parametric regression models can lead to fitting a smooth curve that misrepresents the data.
• Non-parametric regression works for well to fitting a line to a scatter plot where noisy data values, sparse data points or weak inter-relationships interfere with your ability to see a line of best fit.
• A drawback of non-parametric regressions is that it does not produce a functional form of the fitted model.

## Non-parametric regression methods

• Some methods:
• local regression: sliding window with regression fitted to subsets
• step functions: cut the range of a predictor into distinct regions
• regression splines: combine polynomials and step functions to different subsets of a predictor.
• These methods offer a lot of flexibility, while maintaining the ease and interpretability of linear models.

## US economic time series

This dataset is available from http://research.stlouisfed.org/fred2.

Code
library(tidyverse)
ggplot(economics, aes(date, uempmed)) +
geom_point() +
labs(x = "Date", y = "Median unemployment duration")

## A parametric approach

• The curve below is a fit of the polynomial model of order 27:

\color{#006DAE}{y_i = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i1}^2 + \cdots + \beta_{27}x_{i1}^{27} + e_i}

Code
ggplot(economics, aes(date, uempmed)) +
geom_point() +
geom_smooth(method = stats::lm,
formula = y ~ poly(x, 27),
color = "#006DAE") +
labs(x = "Date", y = "Median unemployment\nduration")

# Local regression

## Local regression

• LOESS (LOcal regrESSion) and LOWESS (LOcally WEighted Scatterplot Smoothing) are non-parametric regression methods (LOESS is a generalisation of LOWESS).
• LOESS fits a weighted low order polynomial model to a subset of neighbouring data.
• A user specified “bandwidth”, “smoothing parameter” or “span” \alpha determines how much of the data is used to fit each local polynomial model.
• Large \alpha produce a smoother fit.
• Small \alpha overfits the data with the fitted regression capturing the random error in the data.

## How to fit LOESS curves in R?

The model can be fitted using the loess function where

• the default span is 0.75 and
• the default local polynomial degree is 2.
fit <- loess(uempmed ~ as.numeric(date),
data = economics,
span = 0.75,
degree = 2)
fit
Call:
loess(formula = uempmed ~ as.numeric(date), data = economics,
span = 0.75, degree = 2)

Number of Observations: 574
Equivalent Number of Parameters: 4.34
Residual Standard Error: 2.291 

## Showing LOESS on the plot in R

In ggplot, you can add the loess using geom_smooth with method = loess and method arguments passed as list:

ggplot(economics, aes(date, uempmed)) +
geom_point() +
geom_smooth(method = loess,
method.args = list(span = 0.75,
degree = 2)) 

# Step functions

## Step functions

• The idea of a step function is to cut up the range of a predictor into distinct regions.
• This essentially converts a continuous predictor into an ordered categorical variable.
• We don’t normally use this idea alone!
ggplot(economics, aes(date, uempmed)) +
geom_point() +
geom_smooth(method = stats::lm,
formula = y ~ cut_number(x, 4)) 

# Regression splines

## Splines (mechanical)

• A wooden or metal strip that fits into another part of a machine to make it rotate.

## Splines (drawing)

• A thin, long wood or metal to draw smooth curves.

## Splines (mathematical)

• A spline is a piecewise polynomial function where
• each piece corresponds to a disjoint subinterval that makes up the range of the variable, and
• the function output is the same values at the subinterval boundaries.
• The boundaries of the subintervals are called knots.
• The smoothness of a spline is based on adjacent polynomial pieces sharing common derivative values or up to a certain order.
• The simplest spline consists of step functions (but step functions are not necessary a spline).

## Basis spline (B-spline)

• All possible splines can be constructed from a linear combination of B-splines.

## Fitting B-splines with R

• In R, we can get the basis splines using splines::bs.
• The degree of freedom (df) is the number of knots plus one.
ggplot(economics, aes(date, uempmed)) +
geom_point() +
geom_smooth(method = stats::lm,
formula = y ~ splines::bs(x, df = 20)) 

## Natural cubic splines

• Natural cubic splines is a spline with degree 3 such that the second derivative is zero at the boundaries (i.e. is a linear function at the boundaries).

## More knots results in overfitting

Code
map_dfr(c(0, 2, 3, 8, 14, 48),
~ economics %>%
mutate(
nknot = .x,
nknot_label = paste(nknot, "knots") %>% fct_inorder()
)) %>%
ggplot(aes(date, uempmed, nknot = nknot)) +
geom_point() +
geom_smooth(
method = lm,
formula = y ~ splines::ns(x, df = nknot + 1),
se = FALSE,
colour = "orangered3"
) +
facet_wrap(~ nknot_label) +
ggtitle("Natural cubic splines")

# Takeaways

• Non-parametric regression is useful in data exploration and analysis although parameters must be carefully chosen not to overfit the data.