STAT1003 – Statistical Techniques
Dr. Emi Tanaka
Australian National University
These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.
Statistics is defined as the science and technology of obtaining useful information from data, taking its variability into account.

The best thing about being a statistician is that you get to play in everyone’s backyard.
Starting point
❓ I have a question
I have a dataset
🤔 I have this question
What does the data tell me about my question?
🕵️🕵️♀️
Technical proficiency (understand statistical methods and skilled with statistical software for extracting and analyzing data) alone isn’t enough for practice. Think holistically.

How hard is a first year statistics course?
Types of variables include:
Data / variable may be captured as:
Subset of marks1 for STAT1003 students in 2025
quiz = quiz score out of 6assignment = assignment score out of 100exam = exam score out of 100week2, week3, …, week12 = tutorial attendance for weeks (1 = attended, 0 = absent)| quiz | assignment | exam | week2 | week3 | week4 | week5 | week6 | week7 | week8 | week9 | week10 | week11 | week12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6.0 | 60 | 14 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5.0 | 75 | 79 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
| 5.5 | 90 | 97 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Always get to know the data first
Populations have parameters: a descriptive measure of a population that is usually unobservable and unknown.
Sample statistics are estimated from sample data and used to make inferences about population parameters.
How hard is STAT1003 at ANU for a typical undergraduate student as measured by the average final grade earned by students in STAT1003?
But we only observe data from a sample of \(n\) students.
If \(x_i\) denotes the final grade of the \(i\)-th sampled student, then the sample consists of the values: \[x_1, x_2, \dots, x_n.\]
Sample size is usually much smaller then population size: \(n \ll N\)
Let \(\mu\) denote the population mean (average) final grade of all STAT1003 students. \[\begin{align*} \mu &= \frac{1}{N}(x_{1'} + x_{2'} + \dots + x_{N'}) = \frac{1}{N}\sum_{i=1}^{N} x_{i'}\\ &= {\tiny \frac{1}{14}(73 + 60 + 54 + 62 + 71 + 68 + 57 + 60 + 72 + 57 + 35 + 53 + 58 + 70)} \approx 60.7\\ \end{align*}\]
Let \(\bar{x}\) denote the sample mean (average) final grade of the sampled STAT1003 students. \[\begin{align*} \bar{x} &= \frac{1}{n}(x_1 + x_2 + \dots + x_n) = \frac{1}{n}\sum_{i=1}^{n} x_i\\ &= {\tiny \frac{1}{5}(54 + 71 + 57 + 70 + 53)} = 61\\ \end{align*}\]
\(\bar{x}\) is used to estimate \(\mu\).
Population parameters are typically denoted by Greek letters, e.g.
Population size is often denoted by \(N\).
Garbage in, garbage out (GIGO): the quality of the output is determined by the quality of the input.
Data collection methods include:
Suppose a study tracked sunscreen use and skin cancer, and it was found that the more sunscreen someone used, the more likely the person was to have skin cancer. Does this mean sunscreen causes skin cancer?



\(\sqrt{3}\)
\(|-3|\)
\(e^1 = e\)
\(\log_e (4) = \ln (4)\)
\(1 + 2 + 3 = \displaystyle\sum_{i = 1}^3 i\)

R has 7 packages:
base,datasets,graphics,grDevices,utils,stats,methods,collectively referred to as “Base R”, that are loaded automatically when you launch it.

Photo by Sara Kurfeß on Unsplash
praise) is on CRAN, you can install it by:install.packages() once!package::function() for without loading package:RStudio Desktop (or RStudio IDE)

Console or Source
?function or help(function) to look at the function documentationinstall.packages() to install a package (only once).library() to load a package.package::function() to use a function from a package without loading it.

tips datadata() function to see available datasets and to load them.The tips data from the GGally package is the tip a waiter received in one restaurant.
tidyverse package is useful for this purpose (we will discuss this more later).A statistical summary (or descriptive statistics) provides key numerical and graphical measures that concisely describe the main characteristics of a dataset.
Measures of Central Tendency
Measures of Dispersion (Spread)
Tabular Summaries
Graphical Summaries
There are two types of categorical data (or variable), referred to also as qualitative data:
Numerical variables can be transformed to or captured as ordinal variables, e.g.
Some useful numerical summary includes:
ggplot2 package for all data visualisation (taught in more detail later)Bar charts / bar plots
Pie charts (avoid using)
How do these plots differ?
There are two main types of numerical data:
Some variables are continuous, but measured in a discrete manner, e.g. age (in years).
For discrete data, we can use a barplot to visualise the distribution.
For continuous data, we can use a histogram to visualise the distribution.
The number of bins does affect the histogram appearance, so explore different values to see how it changes the plot.
A measure of central tendency is a location of the “middle”, “center”, or “expected value” of the distribution of your data.

Sample mean (or average) and median are examples of measures of central tendency
What is the average customer tip?
The sample mean or average is:
\[\bar{x} = \frac{1}{n}(x_1+x_2 +\dots + x_n) = \frac{1}{n}\sum_{i=1}^nx_i.\]
The sample median is:
Sample data: \[54, 71, 57, 70, 53\]
The (sample) mean is \[(54 + 71 + 57 + 70 + 53)/5 = 61.\]
Sorted sample data: \[53, 54, 57, 70, 71\]
So (sample) median is \(57\).
Symmetric

Positively skewed or
Right skewed

Negatively skewed or
Left skewed

The sample mode is the value with the highest frequency.
Unimodal distribution

Bimodal distribution

Multimodal distribution

A \(p\)-quantile is the value below which \(p\) (where \(0 < p <1\)) proportion of your data lie below.
A measure of dispersion/spread is a number representing the spread of data around a measure of central tendency.


Population variance: \[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_{i'} - \mu)^2.\]

L = \(Q_1 - 1.5 \times IQR\)
U = \(Q_3 + 1.5 \times IQR\)
How hard is STAT1003 at ANU for a typical undergraduate student?
Here is a sample assignment and quiz marks:

Five number summary: (55, 80, 88, 93, 100)

Mode: 6

STAT1003 – Statistical Techniques