ETC3250/5250: Introduction to Machine Learning

Toyota car price

Predict the Toyota car price from this used car listing data.

Code

library(tidyverse)
toyota <- read_csv("https://emitanaka.org/iml/data/toyota.csv")

Medical insurance costs

Predict the insurance charges given customer characteristics from this data.

Code

insurance <- read_csv("https://emitanaka.org/iml/data/insurance.csv")

Breast cancer diagnosis

Diagnose (diagnosis) a breast mass sample as malignant (M) or benign (B) from the features of its image using Wisconsin breast cancer data set.

Code

cancer <- read_csv("https://emitanaka.org/iml/data/cancer.csv") %>% 
  mutate(diagnosis_malignant = ifelse(diagnosis=="M", 1, 0),
         diagnosis = factor(diagnosis, levels = c("B", "M"))) %>% 
  janitor::clean_names()

Titanic survival

scroll

Predict whether the titanic passenger survived from class, sex and age.

Code

titanic <- datasets::Titanic %>% 
  as.data.frame() %>% 
  pivot_wider(c(Class, Sex, Age), 
              names_from = Survived, 
              values_from = Freq, 
              names_prefix = "Survived_")

Digit recognition with MNIST database

Predict digit 0-9 (label) from a $28\times 28$ (784 pixels) image.

Code

dat_mnist <- dslabs::read_mnist()
mnist <- dat_mnist$train$images %>% 
  as.data.frame() %>% 
  mutate(label = as.factor(dat_mnist$train$label))

Bank marketing data

Predict whether client will subscribe to a term deposit (y) based on direct marketing campaigns of a Portuguese banking institution.

Code

bank <- read_delim("https://emitanaka.org/iml/data/bank-full.csv", delim = ";")

Sales of insurance policy

Predict whether the customer will Purchase a caravan insurance policy based on data here.

Code

caravan <- read_csv("https://emitanaka.org/iml/data/caravan.csv") %>% 
  mutate(across(-Purchase, scale),
         Purchase = factor(Purchase))

Customer survey data

Learn about customer personalities from a customer survey data.

Code

marketing <- read_tsv("https://emitanaka.org/iml/data/marketing_campaign.csv")

marketing_clean <- marketing %>% 
  mutate(Marital_Status = fct_collapse(
    Marital_Status,
    "1" = c("Absurd", "Alone", "Divorced", "Single",
            "Widow", "YOLO"),
    "2" = c("Married", "Together")
  ), 
  Marital_Status = as.numeric(as.character(Marital_Status)),
  Education = fct_collapse(
    Education,
    "3" = c("2n Cycle", "Master", "PhD"),
    "1" = "Basic",
    "2" = "Graduation"
  ),
  Education = as.numeric(as.character(Education)),
  Income = case_when(is.na(Income) ~ 0,
                     Income == 666666 ~ 0,
                     TRUE ~ Income),
  Dt_Customer = year(dmy(Dt_Customer)) - 2011) %>% 
  filter(Year_Birth >= 1940) %>% 
  mutate(Age = 2015 - Year_Birth) %>% 
  select(-Year_Birth)

Yale face database

Profile the faces from the yale face database (Belhumeur et al. 1997).

Code

yalefaces <- read_csv("https://emitanaka.org/iml/data/yalefaces.csv")

Wine quality data

Investigate wine quality based on several features of wines based on this data.

Code

wine <- read_csv("https://emitanaka.org/iml/data/winequalityN.csv")

Fashion MNIST data

Predict the clothe type (labelled 0-9) from a $28\times 28$ (784 pixel) image using the fashion MNIST data by Zalando SE.

Code

fashion <- read_csv("https://emitanaka.org/iml/data/fashion-mnist_train.csv.gz")

ETC3250/5250

Wrap-up

Featured datasets

Toyota car price

Medical insurance costs

Breast cancer diagnosis

Titanic survival

Digit recognition with MNIST database

Bank marketing data

Sales of insurance policy

Customer survey data

Yale face database

Wine quality data

Fashion MNIST data

Toolkit

Supervised learning

Unsupervised learning

Workflow

Other toolkits

Critical thinking

Problem identification

This unit and beyond

Learning more

Exam information