Basics of Statistics

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

What is statistics?

Statistics is defined as the science and technology of obtaining useful information from data, taking its variability into account.

– The Rousseeuw Prize for statistics

Statistics involves:
- designing the collection of data,
- organizing data,
- analyzing data,
- developing methods,
- interpreting the results, and
- communicating results.

Why study statistics?

The best thing about being a statistician is that you get to play in everyone’s backyard.

– John Tukey

In a data rich world, statistical literacy is essential for everyone.
Statistics is essential for making sense of information across fields such as biology, medicine, physics, social sciences, finance, business, and numerous other fields.
Statistical literacy enables us to think critically and make evidence-based decisions

Newspaper articles from The Australian in January 2026

Exploratory vs Confirmatory data analysis

Starting point

❓ I have a question

I have a dataset

🤔 I have this question

What does the data tell me about my question?

🕵️🕵️‍♀️

Become a data detective

Technical proficiency (understand statistical methods and skilled with statistical software for extracting and analyzing data) alone isn’t enough for practice. Think holistically.

Curiosity: Naturally inquisitive and eager to explore the “why” behind data anomalies or trends.
Problem-solving skills: Resourceful and persistent in finding solutions and overcoming data challenges.
Attention to detail: Notices subtle patterns, inconsistencies, or outliers others might miss.
Critical thinking: Evaluates information objectively, questioning assumptions and sources, and have a healthy dose of skepticism.
Communication abilities: Clearly conveys insights and explanations to technical and non-technical audiences.
Ethical judgment: Handles data responsibly and respects privacy and security considerations.
Collaboration: Works well with colleagues from different domains.
Project management: Organizes work efficiently, sets goals, and meets deadlines during investigations.

Specify the population and scope

How hard is a first year statistics course?

Common pitfalls in developing (research) questions are:
- Questions are too broad
- Variables are not measurable
- Data collection to answer the question is not feasible
Good (research) questions are specific
Clarify the population and scope of interest: Who or What? Where? When?

Refine your question

How hard is STAT1003 at ANU for a typical undergraduate student as measured by, say:
- the average final grade earned by students in STAT1003,
- the percentage of students fail or withdraw from STAT1003, and/or
- the grade distribution in STAT1003 compared to other elective courses?

Identify the key variables

Ability to answer questions are dependent on the variables measured.
First step: identify the key variables in your question or data.

Types of variables include:

Outcome / response / dependent variable:
- what you want to explain, predict or understand
- often denoted mathematically as \(y\)
Explanatory / predictor / covariate / independent variable:
- the variable thought to influence or explain the outcome
- often denote mathematically as \(x\)
Confounding variable:
- related to both the explanatory and outcome variables
- often denote mathematically as \(z\)

Measurement types

Second step, consider how the variables are measured.

Data / variable may be captured as:

Categorical (or qualitative) variables:
- Ordinal variables: ordered categories (e.g. satisfaction ratings)
- Nominal variables: categories with no clear ordering (e.g. hair color)
Numerical (or quantitative) variables:
- Discrete variables: a countable number of values (e.g. number of students)
- Continuous variables: any value within a range (e.g. height, weight)

Complex types, like image or video, is out of score for this course.

Case study STAT1003 student marks

Subset of marks¹ for STAT1003 students in 2025

quiz = quiz score out of 6
assignment = assignment score out of 100
exam = exam score out of 100
week2, week3, …, week12 = tutorial attendance for weeks (1 = attended, 0 = absent)

quiz	assignment	exam	week2	week3	week4	week5	week6	week7	week8	week9	week10	week11	week12
6.0	60	14	0	1	0	0	0	0	0	0	0	0	0
5.0	75	79	1	1	1	1	1	1	1	1	1	0	0
5.5	90	97	1	1	1	1	1	1	1	1	1	1	1

Always get to know the data first

What do each row correspond to?
Which variables are outcome variables?
Which ones are independent variables or other types of variables?
What measurement types are each variables?

Population vs. Sample

Populations have parameters: a descriptive measure of a population that is usually unobservable and unknown.

Sample statistics are estimated from sample data and used to make inferences about population parameters.

Ideally, we would measure every single unit of interest (e.g. marks of every STAT1003 student).
But this is often impractical or unavailable (we only have the 2025 data).
Instead, a (representative) sample from the population is used to make inference of the population.

Mathematical setup

How hard is STAT1003 at ANU for a typical undergraduate student as measured by the average final grade earned by students in STAT1003?

Suppose there are \(N\) STAT1003 students since its inception.
If \(x_{i'}\) denotes the final grade of the \(i'\)-th student, then the population consists of the values: \[x_{1'}, x_{2'}, \dots, x_{N'}.\]

But we only observe data from a sample of \(n\) students.
If \(x_i\) denotes the final grade of the \(i\)-th sampled student, then the sample consists of the values: \[x_1, x_2, \dots, x_n.\]
Sample size is usually much smaller then population size: \(n \ll N\)

Population vs sample mean

Let \(\mu\) denote the population mean (average) final grade of all STAT1003 students. \[\begin{align*} \mu &= \frac{1}{N}(x_{1'} + x_{2'} + \dots + x_{N'}) = \frac{1}{N}\sum_{i=1}^{N} x_{i'}\\ &= {\tiny \frac{1}{14}(73 + 60 + 54 + 62 + 71 + 68 + 57 + 60 + 72 + 57 + 35 + 53 + 58 + 70)} \approx 60.7\\ \end{align*}\]
Let \(\bar{x}\) denote the sample mean (average) final grade of the sampled STAT1003 students. \[\begin{align*} \bar{x} &= \frac{1}{n}(x_1 + x_2 + \dots + x_n) = \frac{1}{n}\sum_{i=1}^{n} x_i\\ &= {\tiny \frac{1}{5}(54 + 71 + 57 + 70 + 53)} = 61\\ \end{align*}\]

\(\bar{x}\) is used to estimate \(\mu\).

Mathematical notations and conventions

Population parameters are typically denoted by Greek letters, e.g.
- Population mean/average: \(\mu\)
- Population variance: \(\sigma^2\)
Population size is often denoted by \(N\).

For very large population size, \(N\) is treated as infinity.
Recall, we hardly ever know the values of population parameters.

Observed sample statistics are typically denoted by lower case Roman letters, e.g.
- Sample mean/average: \(\bar{x}\)
- Sample variance: \(s^2\)
We often use \(n\) for sample size.

In this course, we will use:
- lower case Roman letters for observed sample statistics (estimates) and
- upper case Roman letters for yet to be observed sample statistics (estimators).

Data collection methods

High-quality data collection is the foundation of good statistical analysis.

Garbage in, garbage out (GIGO): the quality of the output is determined by the quality of the input.

Data collection methods include:

Experiments: Manipulating variables (often referred to as treatments) to observe effects, e.g. in a clinical study, different types of blood pressure medication tablet can be assgined to patients.
Observational studies: Recording information without intervention, which include surveys (questionnaires or interviews).

Causal inference

Comparative experiments allow stronger evidence to demonstrate causality.
Data in observational studies are generally only sufficient to show association but not causation.

Suppose a study tracked sunscreen use and skin cancer, and it was found that the more sunscreen someone used, the more likely the person was to have skin cancer. Does this mean sunscreen causes skin cancer?

There exists a confounding variable, correlated with both explanatory and response variable.
To make causal conclusions, one has to account for all confounding variables.
There is no guarantee that all confounding variables can be examined or measured.

Summary

Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data.
Two paradigms: confirmatory vs exploratory data analysis.
Holistic thinking: Statistics requires more than technical proficiency.
It provides methodologies for making inferences about populations based on sample data.
GIGO: good study design (i.e. data collection method) is important for making inferences.
Before making inferences, identify the key variables and their measurement type.
Association \(\neq\) Causation: You cannot infer causality from observational studies alone.