Summary statistics for univariate data

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

R package datasets `tips` data

R comes with many built-in datasets that are helpful for learning and practicing data analysis.
Use the data() function to see available datasets and to load them.

The tips data from the GGally package is the tip a waiter received in one restaurant.

Initial data analysis

When given a dataset, start with exploring the data.
The tidyverse package is useful for this purpose (we will discuss this more later).

What is the sample size?
What is the observational unit?
Which of the variables are categorical data? Which ones are numerical data?
Classify the categorical variables as ordinal or nominal.

Statistical summary for univariate data

A statistical summary (or descriptive statistics) provides key numerical and graphical measures that concisely describe the main characteristics of a dataset.

Measures of Central Tendency

Mean (average)
Median
Mode

Measures of Dispersion (Spread)

Range
Variance
Standard deviation
Interquartile range (IQR)

Tabular Summaries

Frequency tables
Contingency tables (cross-tabulations)

Graphical Summaries

Histograms
Boxplots
Bar charts
Scatterplots
Etc

Categorical variables

There are two types of categorical data (or variable), referred to also as qualitative data:

Nominal
- no ordering or relationship
- e.g. marital status, eye color, job, degree, race
Ordinal
- have a distinct ordering
- e.g.,
  - ranking teacher as “poor/fair/good”,
  - survey answer “strongly disagree/disagree/agree/strongly agree”

Numerical variables can be transformed to or captured as ordinal variables, e.g.

income brackets: [0, 1000), [1000, 2000), [2000, 3000), 3000+,
age ranges: [0-18], (18-30], (30, 50), [50, 75), 75+.

Numerical summary for a categorical variable

Some useful numerical summary includes:

Frequency (counts) of each category
Relative frequency (proportion or percentage) of each category
Mode: the most frequent occurring observation

Graphical summary for a categorical variable

We use ggplot2 package for all data visualisation (taught in more detail later)

Bar charts / bar plots

Pie charts (avoid using)

Nominal vs. Ordinal variables

We can use exactly the same statistics for ordinal data we used for nominal data, e.g., frequency tables, bar charts, pie charts, etc.

For ordinal data, preserve the order of the categories.
For nominal data, reorder the categories based on another variable (if appropriate).

Plot 1

Code

ggplot(tips) +
  geom_bar(aes(y = day))

Plot 2

Code

tips |> 
  mutate(day = reorder(day, day, length)) |> 
  ggplot() +
    geom_bar(aes(y = day))

Plot 3

Code

day_order <- rev(c("Thur", "Fri", "Sat", "Sun"))
tips |> 
  mutate(day = factor(day, levels = day_order)) |> 
  ggplot() +
    geom_bar(aes(y = day))

How do these plots differ?

Numerical variables

There are two main types of numerical data:

Continuous
- measured in infinitely small increments
- e.g. height, weight, portfolio returns, and stock prices
Discrete
- measured in fixed increments
- e.g. number of cars you own, and number of heads in three coin flips

Some variables are continuous, but measured in a discrete manner, e.g. age (in years).

Graphical summary for numerical variable

For discrete data, we can use a barplot to visualise the distribution.

For continuous data, we can use a histogram to visualise the distribution.

Histogram

viewof binw = Inputs.range([0.01, 5], {step: 0.01, label: "bin width"})

The number of bins does affect the histogram appearance, so explore different values to see how it changes the plot.

A measure of central tendency

A measure of central tendency is a location of the “middle”, “center”, or “expected value” of the distribution of your data.

Sample mean (or average) and median are examples of measures of central tendency
What is the average customer tip?

Sample mean and median

The sample mean or average is:

\[\bar{x} = \frac{1}{n}(x_1+x_2 +\dots + x_n) = \frac{1}{n}\sum_{i=1}^nx_i.\]

The sample median is:

middle number of the sorted observation when \(n\) is odd, and
average of the two middle sorted observations when \(n\) is even.

Sample data: \[54, 71, 57, 70, 53\]

The (sample) mean is \[(54 + 71 + 57 + 70 + 53)/5 = 61.\]

Sorted sample data: \[53, 54, 57, 70, 71\]

So (sample) median is \(57\).

The mean is commonly used
But the median is more robust to extreme observations (outliers).

Skewness

Skewness is a measure of asymmetry in a given distribution

Symmetric

Mean \(\approx\) Median

Positively skewed or
Right skewed

Mean > Median

Negatively skewed or
Left skewed

Mean < Median

Modality

The sample mode is the value with the highest frequency.

Mode is useful for categorical data.
For numerical data, mode is less useful as there may be no repeated values.
However, we can look at the modality of a distribution: number of peaks in the distribution.

Unimodal distribution

Bimodal distribution

Multimodal distribution

Quantiles

A \(p\)-quantile is the value below which \(p\) (where \(0 < p <1\)) proportion of your data lie below.

Note: quantiles do not need to be data values.
Quartiles are special quantiles that divide the data into four equal parts:
- First quartile (\(Q_1\)) or lower quartile is the 0.25 quantile
- Second quartile (\(Q_2\)) or median is the 0.50 quantile
- Third quartile (\(Q_3\)) or upper quartile is the 0.75 quantile

A measure of dispersion

A measure of dispersion/spread is a number representing the spread of data around a measure of central tendency.

E.g. range, interquartile range (IQR), variance, standard deviation.

Measure of dispersions

Sample deviation: the distance of an observation from its mean \(x_i-\bar{x}\)
Sample variance: \[s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2.\]
Sample standard deviation: the square root of sample variance \(s\)
- Conveys similar information as variance, but measure of units is the same as the data
The range is the difference between the maximum and minimum values in the dataset.
The interquartile range (IQR) is the difference between the third quartile and the first quartile (\(Q_3 - Q_1\)).

Population variance: \[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_{i'} - \mu)^2.\]

Boxplots

L = \(Q_1 - 1.5 \times IQR\)
U = \(Q_3 + 1.5 \times IQR\)

Boxplot do not work well for small datasets and certainly not for \(n < 5\).
Boxplots are poor at showing multimodal distributions.

Case study STAT1003 mark distribution

How hard is STAT1003 at ANU for a typical undergraduate student?

Here is a sample assignment and quiz marks:

Five number summary: (55, 80, 88, 93, 100)

Mode: 6

Note: five number summary is (minimum, \(Q_1\), median, \(Q_3\), maximum)
What do you think based on the distribution of marks for assignment and quiz?

Summary

Summary statistics describe main characteristics of the data

Frequency table
Mode
Barplot

Skewness
Modality
Quantiles
A measure of central tendency: mean and median
A measure of dispersion: range, IQR, variance and standard deviation
Histogram
Boxplot

Summary statistics for univariate data

R package datasets tips data

Initial data analysis

Statistical summary for univariate data

Categorical variables

Numerical summary for a categorical variable

Graphical summary for a categorical variable

Nominal vs. Ordinal variables

Numerical variables

Graphical summary for numerical variable

Histogram

A measure of central tendency

Sample mean and median

Skewness

Modality

Quantiles

A measure of dispersion

Measure of dispersions

Boxplots

Case study STAT1003 mark distribution

Summary

R package datasets `tips` data