STAT1003 – Statistical Techniques
Dr. Emi Tanaka
Australian National University
These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.
tips datadata() function to see available datasets and to load them.The tips data from the GGally package is the tip a waiter received in one restaurant.
tidyverse package is useful for this purpose (we will discuss this more later).A statistical summary (or descriptive statistics) provides key numerical and graphical measures that concisely describe the main characteristics of a dataset.
Measures of Central Tendency
Measures of Dispersion (Spread)
Tabular Summaries
Graphical Summaries
There are two types of categorical data (or variable), referred to also as qualitative data:
Numerical variables can be transformed to or captured as ordinal variables, e.g.
Some useful numerical summary includes:
ggplot2 package for all data visualisation (taught in more detail later)Bar charts / bar plots
Pie charts (avoid using)
How do these plots differ?
There are two main types of numerical data:
Some variables are continuous, but measured in a discrete manner, e.g. age (in years).
For discrete data, we can use a barplot to visualise the distribution.
For continuous data, we can use a histogram to visualise the distribution.
The number of bins does affect the histogram appearance, so explore different values to see how it changes the plot.
A measure of central tendency is a location of the “middle”, “center”, or “expected value” of the distribution of your data.

Sample mean (or average) and median are examples of measures of central tendency
What is the average customer tip?
The sample mean or average is:
\[\bar{x} = \frac{1}{n}(x_1+x_2 +\dots + x_n) = \frac{1}{n}\sum_{i=1}^nx_i.\]
The sample median is:
Sample data: \[54, 71, 57, 70, 53\]
The (sample) mean is \[(54 + 71 + 57 + 70 + 53)/5 = 61.\]
Sorted sample data: \[53, 54, 57, 70, 71\]
So (sample) median is \(57\).
Symmetric

Positively skewed or
Right skewed

Negatively skewed or
Left skewed

The sample mode is the value with the highest frequency.
Unimodal distribution

Bimodal distribution

Multimodal distribution

A \(p\)-quantile is the value below which \(p\) (where \(0 < p <1\)) proportion of your data lie below.
A measure of dispersion/spread is a number representing the spread of data around a measure of central tendency.


Population variance: \[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_{i'} - \mu)^2.\]

L = \(Q_1 - 1.5 \times IQR\)
U = \(Q_3 + 1.5 \times IQR\)
How hard is STAT1003 at ANU for a typical undergraduate student?
Here is a sample assignment and quiz marks:

Five number summary: (55, 80, 88, 93, 100)

Mode: 6

STAT1003 – Statistical Techniques