Summary statistics for univariate data

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

R package datasets tips data

  • R comes with many built-in datasets that are helpful for learning and practicing data analysis.
  • Use the data() function to see available datasets and to load them.

The tips data from the GGally package is the tip a waiter received in one restaurant.

Initial data analysis

  • When given a dataset, start with exploring the data.
  • The tidyverse package is useful for this purpose (we will discuss this more later).
  • What is the sample size?
  • What is the observational unit?
  • Which of the variables are categorical data? Which ones are numerical data?
  • Classify the categorical variables as ordinal or nominal.

Statistical summary for univariate data

A statistical summary (or descriptive statistics) provides key numerical and graphical measures that concisely describe the main characteristics of a dataset.

Measures of Central Tendency

  • Mean (average)
  • Median
  • Mode

Measures of Dispersion (Spread)

  • Range
  • Variance
  • Standard deviation
  • Interquartile range (IQR)

Tabular Summaries

  • Frequency tables
  • Contingency tables (cross-tabulations)

Graphical Summaries

  • Histograms
  • Boxplots
  • Bar charts
  • Scatterplots
  • Etc

Categorical variables

There are two types of categorical data (or variable), referred to also as qualitative data:

  • Nominal
    • no ordering or relationship
    • e.g. marital status, eye color, job, degree, race
  • Ordinal
    • have a distinct ordering
    • e.g.,
      • ranking teacher as “poor/fair/good”,
      • survey answer “strongly disagree/disagree/agree/strongly agree”

G Categorical Categorical Nominal Nominal Categorical--Nominal Ordinal Ordinal Categorical--Ordinal

Numerical variables can be transformed to or captured as ordinal variables, e.g.

  • income brackets: [0, 1000), [1000, 2000), [2000, 3000), 3000+,
  • age ranges: [0-18], (18-30], (30, 50), [50, 75), 75+.

Numerical summary for a categorical variable

Some useful numerical summary includes:

  • Frequency (counts) of each category
  • Relative frequency (proportion or percentage) of each category
  • Mode: the most frequent occurring observation

Graphical summary for a categorical variable

  • We use ggplot2 package for all data visualisation (taught in more detail later)

Bar charts / bar plots

Pie charts (avoid using)

Nominal vs. Ordinal variables

  • We can use exactly the same statistics for ordinal data we used for nominal data, e.g., frequency tables, bar charts, pie charts, etc.
  • For ordinal data, preserve the order of the categories.
  • For nominal data, reorder the categories based on another variable (if appropriate).

Plot 1

Code
ggplot(tips) +
  geom_bar(aes(y = day))

Plot 2

Code
tips |> 
  mutate(day = reorder(day, day, length)) |> 
  ggplot() +
    geom_bar(aes(y = day))

Plot 3

Code
day_order <- rev(c("Thur", "Fri", "Sat", "Sun"))
tips |> 
  mutate(day = factor(day, levels = day_order)) |> 
  ggplot() +
    geom_bar(aes(y = day))

How do these plots differ?

Numerical variables

There are two main types of numerical data:

  • Continuous
    • measured in infinitely small increments
    • e.g. height, weight, portfolio returns, and stock prices
  • Discrete
    • measured in fixed increments
    • e.g. number of cars you own, and number of heads in three coin flips

G Numerical Numerical Continuous Continuous Numerical--Continuous Discrete Discrete Numerical--Discrete

Some variables are continuous, but measured in a discrete manner, e.g. age (in years).

Graphical summary for numerical variable

For discrete data, we can use a barplot to visualise the distribution.

For continuous data, we can use a histogram to visualise the distribution.

Histogram

The number of bins does affect the histogram appearance, so explore different values to see how it changes the plot.

A measure of central tendency

A measure of central tendency is a location of the “middle”, “center”, or “expected value” of the distribution of your data.

  • Sample mean (or average) and median are examples of measures of central tendency

  • What is the average customer tip?

Sample mean and median

The sample mean or average is:

\[\bar{x} = \frac{1}{n}(x_1+x_2 +\dots + x_n) = \frac{1}{n}\sum_{i=1}^nx_i.\]

The sample median is:

  • middle number of the sorted observation when \(n\) is odd, and
  • average of the two middle sorted observations when \(n\) is even.

Sample data: \[54, 71, 57, 70, 53\]

The (sample) mean is \[(54 + 71 + 57 + 70 + 53)/5 = 61.\]

Sorted sample data: \[53, 54, 57, 70, 71\]

So (sample) median is \(57\).

  • The mean is commonly used
  • But the median is more robust to extreme observations (outliers).

Skewness

  • Skewness is a measure of asymmetry in a given distribution


Symmetric


Mean \(\approx\) Median

Positively skewed or
Right skewed

Mean > Median

Negatively skewed or
Left skewed

Mean < Median

Modality

The sample mode is the value with the highest frequency.

  • Mode is useful for categorical data.
  • For numerical data, mode is less useful as there may be no repeated values.
  • However, we can look at the modality of a distribution: number of peaks in the distribution.

Unimodal distribution

Bimodal distribution

Multimodal distribution

Quantiles

A \(p\)-quantile is the value below which \(p\) (where \(0 < p <1\)) proportion of your data lie below.

  • Note: quantiles do not need to be data values.
  • Quartiles are special quantiles that divide the data into four equal parts:
    • First quartile (\(Q_1\)) or lower quartile is the 0.25 quantile
    • Second quartile (\(Q_2\)) or median is the 0.50 quantile
    • Third quartile (\(Q_3\)) or upper quartile is the 0.75 quantile

A measure of dispersion

A measure of dispersion/spread is a number representing the spread of data around a measure of central tendency.

  • E.g. range, interquartile range (IQR), variance, standard deviation.

Measure of dispersions

  • Sample deviation: the distance of an observation from its mean \(x_i-\bar{x}\)
  • Sample variance: \[s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2.\]
  • Sample standard deviation: the square root of sample variance \(s\)
    • Conveys similar information as variance, but measure of units is the same as the data
  • The range is the difference between the maximum and minimum values in the dataset.
  • The interquartile range (IQR) is the difference between the third quartile and the first quartile (\(Q_3 - Q_1\)).

Population variance: \[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_{i'} - \mu)^2.\]

Boxplots

L = \(Q_1 - 1.5 \times IQR\)
U = \(Q_3 + 1.5 \times IQR\)

  • Boxplot do not work well for small datasets and certainly not for \(n < 5\).
  • Boxplots are poor at showing multimodal distributions.

Case study STAT1003 mark distribution

How hard is STAT1003 at ANU for a typical undergraduate student?

Here is a sample assignment and quiz marks:

Five number summary: (55, 80, 88, 93, 100)

Mode: 6

  • Note: five number summary is (minimum, \(Q_1\), median, \(Q_3\), maximum)
  • What do you think based on the distribution of marks for assignment and quiz?

Summary

  • Summary statistics describe main characteristics of the data
  • Frequency table
  • Mode
  • Barplot
  • Skewness
  • Modality
  • Quantiles
  • A measure of central tendency: mean and median
  • A measure of dispersion: range, IQR, variance and standard deviation
  • Histogram
  • Boxplot