Basic Statistical Concepts and Programming II

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Summary Statistics for Bivariate Data

Bivariate data

Bivariate data involves two different variables ($x$ and $y$) for each observation.

Suppose $x_i$ and $y_i$ are two variables measured on the same unit $i$ for $i = 1, 2, \ldots, n$.
It allows us to explore relationships and associations between the two variables.
Examples:
- Height and weight of individuals.
- Temperature and ice cream sales.
- Study time and exam scores.
- Eye color and hair color.

	Categorical	Numerical
Categorical	Contingency table Stacked barplot Percent stacked barplot Side-by-side barplot	Group summary statistics
Numerical	Group summary statistics	Scatterplot Covariance Correlation coefficient

Contingency table

A contingency table (also known as a cross-tabulation or crosstab) display the frequency distribution of two or more categorical variables.

What do you notice between these two approaches?

Stacked barplot

A stacked barplot is used to compare the composition of different groups in a dataset, especially contribution of sub-categories to the total within each main category.

Percent stacked barplot

A percent stacked barplot is ideal for comparing the relative frequencies of subgroups within categories, rather than their absolute counts.

Side-by-side barplot

A side-by-side barplot (also called a grouped barplot) is used to visually compare the values of different subgroups across categories.

Group summary statistics

For a bivariate data where one variable is numerical and the other is categorical, you can use summary statistics for univariate data for each group.

For example, numerical statistics for each group can be computed as:

Graphical statistics by group

Likewise, we can compute graphical statistics for each group.

Beeswarm plot is a type of scatterplot that shows the distribution of data points while avoiding overlap, making it easier to visualize the density and spread of the data.

Case study 🌾 Wheat seed morphological characteristics

A dataset was collected to investigate morphological characteristics associated with seed weight in a line of diploid wheat (Triticum monococcum).

DSeed - identifier for each seed
Weight - weight of seed (mg)
Length - length of seed (mm)
Diameter - diameter of seed (mm)
Moisture - mositure content of seed (as a percentage)
Hardness - endosperm hardness

Scatterplot

A scatterplot is a graphical representation that displays the relationship between two numerical variables by plotting individual data points on a two-dimensional graph.

Sample covariance

Sample covariance is a measure of how much two numerical variables change together.

\[ s_{xy}=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right) \]

Interpretation:
- When $s_{xy} > 0$, the variables tend to increase together.
- When $s_{xy} < 0$, one variable tends to increase when the other decreases.
- When $s_{xy} = 0$, there is no linear relationship between the variables.

Consider the following dataset with two variables, $x$ and $y$:

$i$	$x$	$y$
1	1	10
2	2	70
3	3	100

\[s_{xy} = \frac{1}{2}\left[(1-2)(10-60)+(2-2)(70-60)+(3-2)(100-60)\right]=45\]

But the magnitude of covariance is not easy to interpret since it depends on the units of the variables.

Pearson’s correlation coefficient

Correlation coefficient is a normalised version of covariance.
The sample Pearson correlation coefficient, denoted as $r$, is a measure of the strength of a linear relationship between two variables ($x$ and $y$).

\[r = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2\sum_{i=1}^n(y_i - \bar{y})^2}}\]

The correlation coefficient ranges from -1 to 1.

Interpretation of correlation coefficient

The sign of the correlation coefficient indicates the direction of the relationship.
The magnitude indicates the strength of the linear relationship.

$\|r\|$	Interpretation
0.8 - 1.0	Very strong association
0.6 - 0.8	Strong association
0.4 - 0.6	Moderate association
0.2 - 0.4	Weak association
0.0 - 0.2	Very weak association

viewof nsample = Inputs.number([20, 1000], {step: 20, value: 200, label: "Number of samples"})
viewof r = Inputs.range([-1,1], {step: 0.05, value: 0.8, label: "Correlation coefficient"})

Wrong interpretation of correlation coefficient

Source: xkcd

Just because $x$ and $y$ are highly correlated, it does not mean that $x$ causes $y$ or vice versa – correlation is not causation!

Number of ice cream sales and the rate of drowning deaths.
It is also easy to get spurious correlation if computing many pairwise correlations.

Correlation also only measures a linear relationship, so low correlation doesn’t mean that there is no relationship.

\[r = 0.1628057\]

Summary statistics can be misleading

You can have a bivariate dataset with the exact same:
- marginal mean,
- marginal variance and
- correlation, but the relationship between the two variables can be very different.
Always plot your data!

Summary

	Categorical	Numerical
Categorical	Contingency table Stacked barplot Percent stacked barplot Side-by-side barplot	Group summary statistics
Numerical	Group summary statistics	Scatterplot Covariance Correlation coefficient

Correlation $\neq$ Causation
Always plot the data!

R Objects

Using R as a calculator

$e^{3 + 4}$
$e^{3 + 4} + \frac{1}{3}(1 + 3 + 5)$

But we want to save results to reuse later!

Assignment

You can assign values to objects using <- or = or even ->
Just be consistent which one you use!
The name of the object can be variable so long as it is syntactically valid (no spaces and most special characters, and the name cannot start with a digit)

Vectors

We can combine scalars to form vectors using c():

This is a vector of length 3

This vector is stored as a double with the class as numeric

Vector types

There are four primary types of atomic vectors: logical, integer, double and character.

If a logical value is coerced to numeric or integer, then
- TRUE is 1 and
- FALSE is 0.

Vector coercion

A vector can only consist of the same type.
If you attempt to combine mismatched types together, it will try to coerce all values to the same type.

There are functions to explicitly coerce types, e.g., as.numeric() tries to coerce input to numeric value.

Factor

A factor in R is a special type of integer vector used typically to encode categorical variables.

Lists

Lists allow to combine elements of different types.

You can use str() to see the internal structure of an object in R.

Data frames

data.frame is a special type of a named list where each element of the vector is the same length.

tibble is a Tidyverse version of data.frame in R.

It is still a data.frame, so all functions that work with data.frame objects will also work with tibble objects.

Subsetting vectors Part 1

A vector can be subsetted using integers in [].

Positive integers select elements at the specified positions:

Negative integers exclude elements at the specified positions:

Subsetting vectors Part 2

Logical vectors in [] select elements where logical value is TRUE.

If the logical vector used for subsetting a vector is shorter than it then the logical vector is recycled to match the length of the vector.

Subsetting named vectors

Character vectors select elements based on the name of the vector (if any):

Subsetting lists

Lists can be subsetted using integers in [] or names with $ or [[ ]].

Subsetting data frames

A data.frame can be subsetted using integers in [ , ] or names with $ or [[ ]].

Missing values

NA in R denotes missing values – there are in fact different types of missing values (NA_character_, NA_integer_, NA_real_, NA_complex_, NA_Date_, NA_POSIXct_).

When there are missing values, it can cause issues in the computation.

Below we remove the missing values:

Summary

Four primary types of atomic vectors: logical, integer, double and character.
A vector can only consist of the same type.
Other objects types: factor, list, and data.frame.
There were several ways of subsetting vectors and lists.
Missing values represented as NA and may need to be handled specially.

Base R Cheatsheet

Data Wrangling with Tidyverse

Tidyverse

Tidyverse refers to a collection of R-packages that share a common (opinionated) design philosophy, grammar and data structure.

This trains your mental model to do data science tasks in a manner which may make it easier, faster, and/or fun for you to do these tasks.
library(tidyverse) is a shorthand for loading the 9 core tidyverse packages: ggplot2, dplyr, tidyr, readr, tibble, purrr, stringr, forcats, lubridate.

A grammar of data manipulation

dplyr is a core package in tidyverse
It provides a grammar of data manipulation that is consistent with the tidyverse design philosophy.
Similar data manipulation can be achieved with base R but dplyr provides a more consistent and user-friendly interface for data manipulation tasks.
The earlier concept of dplyr (first on CRAN in 2014-01-29) was implemented in plyr (first on CRAN in 2008-10-08).
The functions in dplyr has been evolving but dplyr v1.0.0 was released on CRAN in 2020-05-29 suggesting that functions in dplyr are maturing and thus the user interface is unlikely to change.

Lifecycle

Functions (and sometimes arguments of functions) in tidyverse packages often are labelled with a badge like on the left

Lionel Henry (2020). lifecycle: Manage the Life Cycle of your Package Functions. R package version 0.2.0.

`dplyr` “verbs”

The main functions of dplyr include:

arrange
select
mutate

rename
filter
summarise

Notice that these functions are verbs.

Functions in dplyr generally have the form:

verb(data, args)

The first argument data is a data.frame object.

Let’s use the tips data from GGally package to illustrate some of the dplyr functions.

What do you think the following will do?

Pipe operator

Almost all the tidyverse packages use the pipe operator %>% from the magrittr package.
R version 4.1.0 introduced a native pipe operator |> which is similar to %>% but with some differences.
x |> f(y) is the same as f(x, y).
x |> f(y) |> g(z) is the same as g(f(x, y), z).
When you see the pipe operator, read it as “and then”.

Tidyselect

Tidyverse packages generally use syntax from the tidyselect package for column selection.

Selection language

The selection language in tidyselect can be found in the documentation:

`tibble` objects

tibble is a modern reimagining of the data.frame object.

Subsetting by column via Tidyverse

What’s the difference between these?

Subsetting by row via Tidyverse

What is happening here?

See also filter_out() for filtering out rows by condition for dplyr v1.2.0 or greater.

Adding or modifying a column via Tidyverse

You can add new columns or modify existing columns using mutate().
For conditional modification, you can use ifelse() or case_when().

Also see recode_values(), replace_values(), and replace_when() for dplyr v1.2.0 or greater.

Sorting columns via Tidyverse

You can use select() along with everything() to reorder columns.

Similarly, you can use relocate() to move columns around.

Sorting rows via Tidyverse

Calculating statistical summaries by group via Tidyverse

The summarise() function allows you to calculate statistical summaries by group.
The n() function is a special function that counts the number of observations in each group
(note: it only works within selective Tidyverse functions).
You are recommended to use .by for group operations.

Applying a function to multiple columns via Tidyverse

The across() function allows you to apply functions to multiple columns.

Summary

Tidyverse packages share a common design philosophy, grammar and data structure.
This can train your mental model that is applicable across multiple packages that adopt the Tidyverse design philosophy.
The core package dplyr provides a grammar of data manipulation.
The main functions in dplyr are verbs that take a data.frame (or tibble) as the first argument.
Combining with pipe operator, it can make the code easier to read and write by humans.

`dplyr` cheatsheet

Debugging

Basic troubleshooting

Whether you are good at programming or not, you will inevitably encounter errors.

If you encounter an error,
1. Read the error message!
2. Google the error message or ask generative AI (like chatGPT)
3. Ask for help with a reproducible example

Reproducible Example with `reprex` LIVE DEMO

Copy your minimum reproducible example then run

reprex::reprex(venue = "html")

Once you run the above command, your clipboard contains the formatted code and output for you to paste into places like Canvas discussion board

\(i\)	\(x\)	\(y\)
1	1	10
2	2	70
3	3	100

Basic Statistical Concepts and Programming II

Summary Statistics for Bivariate Data

Bivariate data

Contingency table

Stacked barplot

Percent stacked barplot

Side-by-side barplot

Group summary statistics

Graphical statistics by group

Case study 🌾 Wheat seed morphological characteristics

Scatterplot

Sample covariance

Pearson’s correlation coefficient

Interpretation of correlation coefficient

Wrong interpretation of correlation coefficient

Summary statistics can be misleading

Summary

R Objects

Using R as a calculator

Assignment

Vectors

Vector types

Vector coercion

Factor

Lists

Data frames

Subsetting vectors Part 1

Subsetting vectors Part 2

Subsetting named vectors

Subsetting lists

Subsetting data frames

Missing values

Summary

Base R Cheatsheet

Data Wrangling with Tidyverse

Tidyverse

A grammar of data manipulation

Lifecycle

dplyr “verbs”

Pipe operator

Tidyselect

Selection language

tibble objects

Subsetting by column via Tidyverse

Subsetting by row via Tidyverse

Adding or modifying a column via Tidyverse

Sorting columns via Tidyverse

Sorting rows via Tidyverse

Calculating statistical summaries by group via Tidyverse

Applying a function to multiple columns via Tidyverse

Summary

dplyr cheatsheet

Debugging

Basic troubleshooting

Reproducible Example with reprex LIVE DEMO

`dplyr` “verbs”

`tibble` objects

`dplyr` cheatsheet

Reproducible Example with `reprex` LIVE DEMO