Basics of Statistics

What is statistics?


Statistics is defined as the science and technology of obtaining useful information from data, taking its variability into account.

– The Rousseeuw Prize for statistics


  • Statistics involves:
    • designing the collection of data,
    • organizing data,
    • analyzing data,
    • developing methods,
    • interpreting the results, and
    • communicating results.

Why study statistics?


The best thing about being a statistician is that you get to play in everyone’s backyard.

John Tukey


  • In a data rich world, statistical literacy is essential for everyone.
  • Statistics is essential for making sense of information across fields such as biology, medicine, physics, social sciences, finance, business, and numerous other fields.
  • Statistical literacy enables us to think critically and make evidence-based decisions

Newspaper articles from The Australian in January 2026

Exploratory vs Confirmatory data analysis


Starting point





❓ I have a question








I have a dataset







🤔 I have this question



What does the data tell me about my question?

🕵️🕵️‍♀️

Become a data detective

Technical proficiency (understand statistical methods and skilled with statistical software for extracting and analyzing data) alone isn’t enough for practice. Think holistically.

  • Curiosity: Naturally inquisitive and eager to explore the “why” behind data anomalies or trends.
  • Problem-solving skills: Resourceful and persistent in finding solutions and overcoming data challenges.
  • Attention to detail: Notices subtle patterns, inconsistencies, or outliers others might miss.
  • Critical thinking: Evaluates information objectively, questioning assumptions and sources, and have a healthy dose of skepticism.
  • Communication abilities: Clearly conveys insights and explanations to technical and non-technical audiences.
  • Ethical judgment: Handles data responsibly and respects privacy and security considerations.
  • Collaboration: Works well with colleagues from different domains.
  • Project management: Organizes work efficiently, sets goals, and meets deadlines during investigations.

Specify the population and scope

How hard is a first year statistics course?

  • Common pitfalls in developing (research) questions are:
    • Questions are too broad
    • Variables are not measurable
    • Data collection to answer the question is not feasible
  • Good (research) questions are specific
  • Clarify the population and scope of interest: Who or What? Where? When?
  • Refine your question
  • How hard is STAT1003 at ANU for a typical undergraduate student as measured by, say:
    • the average final grade earned by students in STAT1003,
    • the percentage of students fail or withdraw from STAT1003, and/or
    • the grade distribution in STAT1003 compared to other elective courses?

Identify the key variables

  • Ability to answer questions are dependent on the variables measured.
  • First step: identify the key variables in your question or data.

Types of variables include:

  • Outcome / response / dependent variable:
    • what you want to explain, predict or understand
    • often denoted mathematically as \(y\)
  • Explanatory / predictor / covariate / independent variable:
    • the variable thought to influence or explain the outcome
    • often denote mathematically as \(x\)
  • Confounding variable:
    • related to both the explanatory and outcome variables
    • often denote mathematically as \(z\)



G x x y y x->y z z x->z y->z z->x z->y

Measurement types

  • Second step, consider how the variables are measured.

Data / variable may be captured as:

  • Categorical (or qualitative) variables:
    • Ordinal variables: ordered categories (e.g. satisfaction ratings)
    • Nominal variables: categories with no clear ordering (e.g. hair color)
  • Numerical (or quantitative) variables:
    • Discrete variables: a countable number of values (e.g. number of students)
    • Continuous variables: any value within a range (e.g. height, weight)

G Data Data / Variable Categorical Categorical / Qualitative Data--Categorical Numerical Numerical / Quantitative Data--Numerical Nominal Nominal Categorical--Nominal Ordinal Ordinal Categorical--Ordinal Discrete Discrete Numerical--Discrete Continuous Continuous Numerical--Continuous

  • Complex types, like image or video, is out of score for this course.

Case study STAT1003 student marks

Subset of marks1 for STAT1003 students in 2025

  • quiz = quiz score out of 6
  • assignment = assignment score out of 100
  • exam = exam score out of 100
  • week2, week3, …, week12 = tutorial attendance for weeks (1 = attended, 0 = absent)
quiz assignment exam week2 week3 week4 week5 week6 week7 week8 week9 week10 week11 week12
6.0 60 14 0 1 0 0 0 0 0 0 0 0 0
5.0 75 79 1 1 1 1 1 1 1 1 1 0 0
5.5 90 97 1 1 1 1 1 1 1 1 1 1 1

Always get to know the data first

  • What do each row correspond to?
  • Which variables are outcome variables?
  • Which ones are independent variables or other types of variables?
  • What measurement types are each variables?

Population vs. Sample

Populations have parameters: a descriptive measure of a population that is usually unobservable and unknown.


Sample statistics are estimated from sample data and used to make inferences about population parameters.

  • Ideally, we would measure every single unit of interest (e.g. marks of every STAT1003 student).
  • But this is often impractical or unavailable (we only have the 2025 data).
  • Instead, a (representative) sample from the population is used to make inference of the population.

Mathematical setup

How hard is STAT1003 at ANU for a typical undergraduate student as measured by the average final grade earned by students in STAT1003?

  • Suppose there are \(N\) STAT1003 students since its inception.
  • If \(x_{i'}\) denotes the final grade of the \(i'\)-th student, then the population consists of the values: \[x_{1'}, x_{2'}, \dots, x_{N'}.\]
  • But we only observe data from a sample of \(n\) students.

  • If \(x_i\) denotes the final grade of the \(i\)-th sampled student, then the sample consists of the values: \[x_1, x_2, \dots, x_n.\]

  • Sample size is usually much smaller then population size: \(n \ll N\)

Population vs sample mean

  • Let \(\mu\) denote the population mean (average) final grade of all STAT1003 students. \[\begin{align*} \mu &= \frac{1}{N}(x_{1'} + x_{2'} + \dots + x_{N'}) = \frac{1}{N}\sum_{i=1}^{N} x_{i'}\\ &= {\tiny \frac{1}{14}(73 + 60 + 54 + 62 + 71 + 68 + 57 + 60 + 72 + 57 + 35 + 53 + 58 + 70)} \approx 60.7\\ \end{align*}\]

  • Let \(\bar{x}\) denote the sample mean (average) final grade of the sampled STAT1003 students. \[\begin{align*} \bar{x} &= \frac{1}{n}(x_1 + x_2 + \dots + x_n) = \frac{1}{n}\sum_{i=1}^{n} x_i\\ &= {\tiny \frac{1}{5}(54 + 71 + 57 + 70 + 53)} = 61\\ \end{align*}\]

\(\bar{x}\) is used to estimate \(\mu\).

Mathematical notations and conventions

  • Population parameters are typically denoted by Greek letters, e.g. 

    • Population mean/average: \(\mu\)
    • Population variance: \(\sigma^2\)
  • Population size is often denoted by \(N\).

  • For very large population size, \(N\) is treated as infinity.
  • Recall, we hardly ever know the values of population parameters.
  • Observed sample statistics are typically denoted by lower case Roman letters, e.g.
    • Sample mean/average: \(\bar{x}\)
    • Sample variance: \(s^2\)
  • We often use \(n\) for sample size.
  • In this course, we will use:
    • lower case Roman letters for observed sample statistics (estimates) and
    • upper case Roman letters for yet to be observed sample statistics (estimators).

Data collection methods

  • High-quality data collection is the foundation of good statistical analysis.

Garbage in, garbage out (GIGO): the quality of the output is determined by the quality of the input.

Data collection methods include:

  • Experiments: Manipulating variables (often referred to as treatments) to observe effects, e.g. in a clinical study, different types of blood pressure medication tablet can be assgined to patients.
  • Observational studies: Recording information without intervention, which include surveys (questionnaires or interviews).

Causal inference

  • Comparative experiments allow stronger evidence to demonstrate causality.
  • Data in observational studies are generally only sufficient to show association but not causation.

Suppose a study tracked sunscreen use and skin cancer, and it was found that the more sunscreen someone used, the more likely the person was to have skin cancer. Does this mean sunscreen causes skin cancer?

  • There exists a confounding variable, correlated with both explanatory and response variable.
  • To make causal conclusions, one has to account for all confounding variables.
  • There is no guarantee that all confounding variables can be examined or measured.

exposure exposure sunscreen sunscreen exposure->sunscreen cancer cancer exposure->cancer

Summary

  • Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data.
  • Two paradigms: confirmatory vs exploratory data analysis.
  • Holistic thinking: Statistics requires more than technical proficiency.
  • It provides methodologies for making inferences about populations based on sample data.
  • GIGO: good study design (i.e. data collection method) is important for making inferences.
  • Before making inferences, identify the key variables and their measurement type.
  • Association \(\neq\) Causation: You cannot infer causality from observational studies alone.

Getting Started with R

What is R?

  • R is a programming language predominately for data analysis.
  • RStudio Desktop is an integrated development environment (IDE) that helps you to use R.

  • Visual Studio Code and Positron are other popular IDEs.

Interactively working with R

  • You can use R like a calculator:
    1. \(1 + 1\)
    2. \(\dfrac{6}{2}+ 0.5\)
    3. \((1 - 4) \times 3 - 6^2\)

How do you use R?

  • RStudio Desktop (or RStudio IDE) is the most common way to use R.

Customise Global Options

  • Go to RStudio > Tools > Global Options…
  • Under the General tab, make sure the “Restore .RData into workspace at startup” is unticked.
  • This avoids unexpectedly loading (old) data into your workspace and making your code only work in your workspace, but not for others (which is bad reproducible practice).

Arithmetics

  1. \(\sqrt{3}\)

  2. \(|-3|\)

  3. \(e^1 = e\)

  4. \(\log_e (4) = \ln (4)\)

  5. \(1 + 2 + 3 = \displaystyle\sum_{i = 1}^3 i\)

Functions

  • There are many functions in R.
  • You can look at the documentation on how to use it:

Finding functions

  • To find indexed functions for a package:
  • Google it with a good set of keywords.
  • The recent trend is ask a large language model.

Why learn R?

  • R is one of the top programming languages for statistics or data science.
    • Python is also a good alternative language for data science.
    • Better to have a mastery of at least one language rather than none.
  • R was initially developed by statisticians for statisticians.
    • State-of-the-art statistical methods are more readily available in R.
  • R has a very active and friendly community.
  • R is a free and open source software (FOSS) and is a cross-platform language:
    • free = money is not a barrier to use it,
    • open source software = transparency,
    • cross-platform = can be used on Windows, Mac, and Linux.

Base R

R has 7 packages:

  • base,
  • datasets,
  • graphics,
  • grDevices,
  • utils,
  • stats,
  • methods,

collectively referred to as “Base R”, that are loaded automatically when you launch it.

  • The functions in the base packages are generally well-tested and trustworthy.

Contributed R Packages

  • R packages are community developed extensions to R (much like apps on your mobile).
  • The Comprehensive R Archive Network (CRAN) is a volunteer maintained repository that hosts submitted R packages that are approved (much like an app store).
  • There are close to 20,000 packages available on CRAN but the qualities of R packages vary.
  • There are other repositories that host R packages, e.g. Bioconductor for bioinformatics, R Universe, R-Forge, GitHub (we won’t cover these).

Photo by Sara Kurfeß on Unsplash

Using packages on CRAN

  • If the package (say praise) is on CRAN, you can install it by:
install.packages("praise")
  • You only need to install.packages() once!
  • Loading exported functions from a package:
  • Use package::function() for without loading package:

Summary

RStudio Desktop (or RStudio IDE)

Console or Source


  • Use ?function or help(function) to look at the function documentation
  • Use install.packages() to install a package (only once).
  • Use library() to load a package.
  • Use package::function() to use a function from a package without loading it.

RStudio Desktop Cheatsheet

RStudio Desktop Cheatsheet

Descriptive and Summary Statistics

R package datasets tips data

  • R comes with many built-in datasets that are helpful for learning and practicing data analysis.
  • Use the data() function to see available datasets and to load them.

The tips data from the GGally package is the tip a waiter received in one restaurant.

Initial data analysis

  • When given a dataset, start with exploring the data.
  • The tidyverse package is useful for this purpose (we will discuss this more later).
  • What is the sample size?
  • What is the observational unit?
  • Which of the variables are categorical data? Which ones are numerical data?
  • Classify the categorical variables as ordinal or nominal.

Statistical summary for univariate data

A statistical summary (or descriptive statistics) provides key numerical and graphical measures that concisely describe the main characteristics of a dataset.

Measures of Central Tendency

  • Mean (average)
  • Median
  • Mode

Measures of Dispersion (Spread)

  • Range
  • Variance
  • Standard deviation
  • Interquartile range (IQR)

Tabular Summaries

  • Frequency tables
  • Contingency tables (cross-tabulations)

Graphical Summaries

  • Histograms
  • Boxplots
  • Bar charts
  • Scatterplots
  • Etc

Categorical variables

There are two types of categorical data (or variable), referred to also as qualitative data:

  • Nominal
    • no ordering or relationship
    • e.g. marital status, eye color, job, degree, race
  • Ordinal
    • have a distinct ordering
    • e.g.,
      • ranking teacher as “poor/fair/good”,
      • survey answer “strongly disagree/disagree/agree/strongly agree”

G Categorical Categorical Nominal Nominal Categorical--Nominal Ordinal Ordinal Categorical--Ordinal

Numerical variables can be transformed to or captured as ordinal variables, e.g.

  • income brackets: [0, 1000), [1000, 2000), [2000, 3000), 3000+,
  • age ranges: [0-18], (18-30], (30, 50), [50, 75), 75+.

Numerical summary for a categorical variable

Some useful numerical summary includes:

  • Frequency (counts) of each category
  • Relative frequency (proportion or percentage) of each category
  • Mode: the most frequent occurring observation

Graphical summary for a categorical variable

  • We use ggplot2 package for all data visualisation (taught in more detail later)

Bar charts / bar plots

Pie charts (avoid using)

Nominal vs. Ordinal variables

  • We can use exactly the same statistics for ordinal data we used for nominal data, e.g., frequency tables, bar charts, pie charts, etc.
  • For ordinal data, preserve the order of the categories.
  • For nominal data, reorder the categories based on another variable (if appropriate).

Plot 1

Code
ggplot(tips) +
  geom_bar(aes(y = day))

Plot 2

Code
tips |> 
  mutate(day = reorder(day, day, length)) |> 
  ggplot() +
    geom_bar(aes(y = day))

Plot 3

Code
day_order <- rev(c("Thur", "Fri", "Sat", "Sun"))
tips |> 
  mutate(day = factor(day, levels = day_order)) |> 
  ggplot() +
    geom_bar(aes(y = day))

How do these plots differ?

Numerical variables

There are two main types of numerical data:

  • Continuous
    • measured in infinitely small increments
    • e.g. height, weight, portfolio returns, and stock prices
  • Discrete
    • measured in fixed increments
    • e.g. number of cars you own, and number of heads in three coin flips

G Numerical Numerical Continuous Continuous Numerical--Continuous Discrete Discrete Numerical--Discrete

Some variables are continuous, but measured in a discrete manner, e.g. age (in years).

Graphical summary for numerical variable

For discrete data, we can use a barplot to visualise the distribution.

For continuous data, we can use a histogram to visualise the distribution.

Histogram

The number of bins does affect the histogram appearance, so explore different values to see how it changes the plot.

A measure of central tendency

A measure of central tendency is a location of the “middle”, “center”, or “expected value” of the distribution of your data.

  • Sample mean (or average) and median are examples of measures of central tendency

  • What is the average customer tip?

Sample mean and median

The sample mean or average is:

\[\bar{x} = \frac{1}{n}(x_1+x_2 +\dots + x_n) = \frac{1}{n}\sum_{i=1}^nx_i.\]

The sample median is:

  • middle number of the sorted observation when \(n\) is odd, and
  • average of the two middle sorted observations when \(n\) is even.

Sample data: \[54, 71, 57, 70, 53\]

The (sample) mean is \[(54 + 71 + 57 + 70 + 53)/5 = 61.\]

Sorted sample data: \[53, 54, 57, 70, 71\]

So (sample) median is \(57\).

  • The mean is commonly used
  • But the median is more robust to extreme observations (outliers).

Skewness

  • Skewness is a measure of asymmetry in a given distribution


Symmetric


Mean \(\approx\) Median

Positively skewed or
Right skewed

Mean > Median

Negatively skewed or
Left skewed

Mean < Median

Modality

The sample mode is the value with the highest frequency.

  • Mode is useful for categorical data.
  • For numerical data, mode is less useful as there may be no repeated values.
  • However, we can look at the modality of a distribution: number of peaks in the distribution.

Unimodal distribution

Bimodal distribution

Multimodal distribution

Quantiles

A \(p\)-quantile is the value below which \(p\) (where \(0 < p <1\)) proportion of your data lie below.

  • Note: quantiles do not need to be data values.
  • Quartiles are special quantiles that divide the data into four equal parts:
    • First quartile (\(Q_1\)) or lower quartile is the 0.25 quantile
    • Second quartile (\(Q_2\)) or median is the 0.50 quantile
    • Third quartile (\(Q_3\)) or upper quartile is the 0.75 quantile

A measure of dispersion

A measure of dispersion/spread is a number representing the spread of data around a measure of central tendency.

  • E.g. range, interquartile range (IQR), variance, standard deviation.

Measure of dispersions

  • Sample deviation: the distance of an observation from its mean \(x_i-\bar{x}\)
  • Sample variance: \[s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2.\]
  • Sample standard deviation: the square root of sample variance \(s\)
    • Conveys similar information as variance, but measure of units is the same as the data
  • The range is the difference between the maximum and minimum values in the dataset.
  • The interquartile range (IQR) is the difference between the third quartile and the first quartile (\(Q_3 - Q_1\)).

Population variance: \[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_{i'} - \mu)^2.\]

Boxplots

L = \(Q_1 - 1.5 \times IQR\)
U = \(Q_3 + 1.5 \times IQR\)

  • Boxplot do not work well for small datasets and certainly not for \(n < 5\).
  • Boxplots are poor at showing multimodal distributions.

Case study STAT1003 mark distribution

How hard is STAT1003 at ANU for a typical undergraduate student?

Here is a sample assignment and quiz marks:

Five number summary: (55, 80, 88, 93, 100)

Mode: 6

  • Note: five number summary is (minimum, \(Q_1\), median, \(Q_3\), maximum)
  • What do you think based on the distribution of marks for assignment and quiz?

Summary

  • Summary statistics describe main characteristics of the data
  • Frequency table
  • Mode
  • Barplot
  • Skewness
  • Modality
  • Quantiles
  • A measure of central tendency: mean and median
  • A measure of dispersion: range, IQR, variance and standard deviation
  • Histogram
  • Boxplot