Basic Statistical Concepts and Programming I

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Basics of Statistics

What is statistics?

Statistics is defined as the science and technology of obtaining useful information from data, taking its variability into account.

– The Rousseeuw Prize for statistics

Statistics involves:
- designing the collection of data,
- organizing data,
- analyzing data,
- developing methods,
- interpreting the results, and
- communicating results.

Why study statistics?

The best thing about being a statistician is that you get to play in everyone’s backyard.

– John Tukey

In a data rich world, statistical literacy is essential for everyone.
Statistics is essential for making sense of information across fields such as biology, medicine, physics, social sciences, finance, business, and numerous other fields.
Statistical literacy enables us to think critically and make evidence-based decisions

Newspaper articles from The Australian in January 2026

Exploratory vs Confirmatory data analysis

Starting point

❓ I have a question

I have a dataset

🤔 I have this question

What does the data tell me about my question?

🕵️🕵️‍♀️

Become a data detective

Technical proficiency (understand statistical methods and skilled with statistical software for extracting and analyzing data) alone isn’t enough for practice. Think holistically.

Curiosity: Naturally inquisitive and eager to explore the “why” behind data anomalies or trends.
Problem-solving skills: Resourceful and persistent in finding solutions and overcoming data challenges.
Attention to detail: Notices subtle patterns, inconsistencies, or outliers others might miss.
Critical thinking: Evaluates information objectively, questioning assumptions and sources, and have a healthy dose of skepticism.
Communication abilities: Clearly conveys insights and explanations to technical and non-technical audiences.
Ethical judgment: Handles data responsibly and respects privacy and security considerations.
Collaboration: Works well with colleagues from different domains.
Project management: Organizes work efficiently, sets goals, and meets deadlines during investigations.

Specify the population and scope

How hard is a first year statistics course?

Common pitfalls in developing (research) questions are:
- Questions are too broad
- Variables are not measurable
- Data collection to answer the question is not feasible
Good (research) questions are specific
Clarify the population and scope of interest: Who or What? Where? When?

Refine your question

How hard is STAT1003 at ANU for a typical undergraduate student as measured by, say:
- the average final grade earned by students in STAT1003,
- the percentage of students fail or withdraw from STAT1003, and/or
- the grade distribution in STAT1003 compared to other elective courses?

Identify the key variables

Ability to answer questions are dependent on the variables measured.
First step: identify the key variables in your question or data.

Types of variables include:

Outcome / response / dependent variable:
- what you want to explain, predict or understand
- often denoted mathematically as \(y\)
Explanatory / predictor / covariate / independent variable:
- the variable thought to influence or explain the outcome
- often denote mathematically as \(x\)
Confounding variable:
- related to both the explanatory and outcome variables
- often denote mathematically as \(z\)

Measurement types

Second step, consider how the variables are measured.

Data / variable may be captured as:

Categorical (or qualitative) variables:
- Ordinal variables: ordered categories (e.g. satisfaction ratings)
- Nominal variables: categories with no clear ordering (e.g. hair color)
Numerical (or quantitative) variables:
- Discrete variables: a countable number of values (e.g. number of students)
- Continuous variables: any value within a range (e.g. height, weight)

Complex types, like image or video, is out of score for this course.

Case study STAT1003 student marks

Subset of marks¹ for STAT1003 students in 2025

quiz = quiz score out of 6
assignment = assignment score out of 100
exam = exam score out of 100
week2, week3, …, week12 = tutorial attendance for weeks (1 = attended, 0 = absent)

quiz	assignment	exam	week2	week3	week4	week5	week6	week7	week8	week9	week10	week11	week12
6.0	60	14	0	1	0	0	0	0	0	0	0	0	0
5.0	75	79	1	1	1	1	1	1	1	1	1	0	0
5.5	90	97	1	1	1	1	1	1	1	1	1	1	1

Always get to know the data first

What do each row correspond to?
Which variables are outcome variables?
Which ones are independent variables or other types of variables?
What measurement types are each variables?

Population vs. Sample

Populations have parameters: a descriptive measure of a population that is usually unobservable and unknown.

Sample statistics are estimated from sample data and used to make inferences about population parameters.

Ideally, we would measure every single unit of interest (e.g. marks of every STAT1003 student).
But this is often impractical or unavailable (we only have the 2025 data).
Instead, a (representative) sample from the population is used to make inference of the population.

Mathematical setup

How hard is STAT1003 at ANU for a typical undergraduate student as measured by the average final grade earned by students in STAT1003?

Suppose there are \(N\) STAT1003 students since its inception.
If \(x_{i'}\) denotes the final grade of the \(i'\)-th student, then the population consists of the values: \[x_{1'}, x_{2'}, \dots, x_{N'}.\]

But we only observe data from a sample of \(n\) students.
If \(x_i\) denotes the final grade of the \(i\)-th sampled student, then the sample consists of the values: \[x_1, x_2, \dots, x_n.\]
Sample size is usually much smaller then population size: \(n \ll N\)

Population vs sample mean

Let \(\mu\) denote the population mean (average) final grade of all STAT1003 students. \[\begin{align*} \mu &= \frac{1}{N}(x_{1'} + x_{2'} + \dots + x_{N'}) = \frac{1}{N}\sum_{i=1}^{N} x_{i'}\\ &= {\tiny \frac{1}{14}(73 + 60 + 54 + 62 + 71 + 68 + 57 + 60 + 72 + 57 + 35 + 53 + 58 + 70)} \approx 60.7\\ \end{align*}\]
Let \(\bar{x}\) denote the sample mean (average) final grade of the sampled STAT1003 students. \[\begin{align*} \bar{x} &= \frac{1}{n}(x_1 + x_2 + \dots + x_n) = \frac{1}{n}\sum_{i=1}^{n} x_i\\ &= {\tiny \frac{1}{5}(54 + 71 + 57 + 70 + 53)} = 61\\ \end{align*}\]

\(\bar{x}\) is used to estimate \(\mu\).

Mathematical notations and conventions

Population parameters are typically denoted by Greek letters, e.g.
- Population mean/average: \(\mu\)
- Population variance: \(\sigma^2\)
Population size is often denoted by \(N\).

For very large population size, \(N\) is treated as infinity.
Recall, we hardly ever know the values of population parameters.

Observed sample statistics are typically denoted by lower case Roman letters, e.g.
- Sample mean/average: \(\bar{x}\)
- Sample variance: \(s^2\)
We often use \(n\) for sample size.

In this course, we will use:
- lower case Roman letters for observed sample statistics (estimates) and
- upper case Roman letters for yet to be observed sample statistics (estimators).

Data collection methods

High-quality data collection is the foundation of good statistical analysis.

Garbage in, garbage out (GIGO): the quality of the output is determined by the quality of the input.

Data collection methods include:

Experiments: Manipulating variables (often referred to as treatments) to observe effects, e.g. in a clinical study, different types of blood pressure medication tablet can be assgined to patients.
Observational studies: Recording information without intervention, which include surveys (questionnaires or interviews).

Causal inference

Comparative experiments allow stronger evidence to demonstrate causality.
Data in observational studies are generally only sufficient to show association but not causation.

Suppose a study tracked sunscreen use and skin cancer, and it was found that the more sunscreen someone used, the more likely the person was to have skin cancer. Does this mean sunscreen causes skin cancer?

There exists a confounding variable, correlated with both explanatory and response variable.
To make causal conclusions, one has to account for all confounding variables.
There is no guarantee that all confounding variables can be examined or measured.

Summary

Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data.
Two paradigms: confirmatory vs exploratory data analysis.
Holistic thinking: Statistics requires more than technical proficiency.
It provides methodologies for making inferences about populations based on sample data.
GIGO: good study design (i.e. data collection method) is important for making inferences.
Before making inferences, identify the key variables and their measurement type.
Association \(\neq\) Causation: You cannot infer causality from observational studies alone.

Getting Started with R

What is R?

R is a programming language predominately for data analysis.

RStudio Desktop is an integrated development environment (IDE) that helps you to use R.

Visual Studio Code and Positron are other popular IDEs.

Interactively working with R

You can use R like a calculator:
1. \(1 + 1\)
2. \(\dfrac{6}{2}+ 0.5\)
3. \((1 - 4) \times 3 - 6^2\)

How do you use R?

RStudio Desktop (or RStudio IDE) is the most common way to use R.

Customise Global Options

Go to RStudio > Tools > Global Options…
Under the General tab, make sure the “Restore .RData into workspace at startup” is unticked.
This avoids unexpectedly loading (old) data into your workspace and making your code only work in your workspace, but not for others (which is bad reproducible practice).

Arithmetics

\(\sqrt{3}\)
\(|-3|\)
\(e^1 = e\)
\(\log_e (4) = \ln (4)\)
\(1 + 2 + 3 = \displaystyle\sum_{i = 1}^3 i\)

Functions

There are many functions in R.
You can look at the documentation on how to use it:

Finding functions

To find indexed functions for a package:

Google it with a good set of keywords.
The recent trend is ask a large language model.

Why learn R?

R is one of the top programming languages for statistics or data science.
- Python is also a good alternative language for data science.
- Better to have a mastery of at least one language rather than none.
R was initially developed by statisticians for statisticians.
- State-of-the-art statistical methods are more readily available in R.
R has a very active and friendly community.
R is a free and open source software (FOSS) and is a cross-platform language:
- free = money is not a barrier to use it,
- open source software = transparency,
- cross-platform = can be used on Windows, Mac, and Linux.

Base R

R has 7 packages:

base,
datasets,
graphics,
grDevices,

utils,
stats,
methods,

collectively referred to as “Base R”, that are loaded automatically when you launch it.

The functions in the base packages are generally well-tested and trustworthy.

Contributed R Packages

R packages are community developed extensions to R (much like apps on your mobile).
The Comprehensive R Archive Network (CRAN) is a volunteer maintained repository that hosts submitted R packages that are approved (much like an app store).
There are close to 20,000 packages available on CRAN but the qualities of R packages vary.
There are other repositories that host R packages, e.g. Bioconductor for bioinformatics, R Universe, R-Forge, GitHub (we won’t cover these).

Photo by Sara Kurfeß on Unsplash

Using packages on CRAN

If the package (say praise) is on CRAN, you can install it by:

install.packages("praise")

You only need to install.packages() once!

Loading exported functions from a package:

Use package::function() for without loading package:

Summary

RStudio Desktop (or RStudio IDE)

Console or Source

Use ?function or help(function) to look at the function documentation
Use install.packages() to install a package (only once).
Use library() to load a package.
Use package::function() to use a function from a package without loading it.

RStudio Desktop Cheatsheet

Descriptive and Summary Statistics

R package datasets `tips` data

R comes with many built-in datasets that are helpful for learning and practicing data analysis.
Use the data() function to see available datasets and to load them.

The tips data from the GGally package is the tip a waiter received in one restaurant.

Initial data analysis

When given a dataset, start with exploring the data.
The tidyverse package is useful for this purpose (we will discuss this more later).

What is the sample size?
What is the observational unit?
Which of the variables are categorical data? Which ones are numerical data?
Classify the categorical variables as ordinal or nominal.

Statistical summary for univariate data

A statistical summary (or descriptive statistics) provides key numerical and graphical measures that concisely describe the main characteristics of a dataset.

Measures of Central Tendency

Mean (average)
Median
Mode

Measures of Dispersion (Spread)

Range
Variance
Standard deviation
Interquartile range (IQR)

Tabular Summaries

Frequency tables
Contingency tables (cross-tabulations)

Graphical Summaries

Histograms
Boxplots
Bar charts
Scatterplots
Etc

Categorical variables

There are two types of categorical data (or variable), referred to also as qualitative data:

Nominal
- no ordering or relationship
- e.g. marital status, eye color, job, degree, race
Ordinal
- have a distinct ordering
- e.g.,
  - ranking teacher as “poor/fair/good”,
  - survey answer “strongly disagree/disagree/agree/strongly agree”

Numerical variables can be transformed to or captured as ordinal variables, e.g.

income brackets: [0, 1000), [1000, 2000), [2000, 3000), 3000+,
age ranges: [0-18], (18-30], (30, 50), [50, 75), 75+.

Numerical summary for a categorical variable

Some useful numerical summary includes:

Frequency (counts) of each category
Relative frequency (proportion or percentage) of each category
Mode: the most frequent occurring observation

Graphical summary for a categorical variable

We use ggplot2 package for all data visualisation (taught in more detail later)

Bar charts / bar plots

Pie charts (avoid using)

Nominal vs. Ordinal variables

We can use exactly the same statistics for ordinal data we used for nominal data, e.g., frequency tables, bar charts, pie charts, etc.

For ordinal data, preserve the order of the categories.
For nominal data, reorder the categories based on another variable (if appropriate).

Plot 1

Code

ggplot(tips) +
  geom_bar(aes(y = day))

Plot 2

Code

tips |> 
  mutate(day = reorder(day, day, length)) |> 
  ggplot() +
    geom_bar(aes(y = day))

Plot 3

Code

day_order <- rev(c("Thur", "Fri", "Sat", "Sun"))
tips |> 
  mutate(day = factor(day, levels = day_order)) |> 
  ggplot() +
    geom_bar(aes(y = day))

How do these plots differ?

Numerical variables

There are two main types of numerical data:

Continuous
- measured in infinitely small increments
- e.g. height, weight, portfolio returns, and stock prices
Discrete
- measured in fixed increments
- e.g. number of cars you own, and number of heads in three coin flips

Some variables are continuous, but measured in a discrete manner, e.g. age (in years).

Graphical summary for numerical variable

For discrete data, we can use a barplot to visualise the distribution.

For continuous data, we can use a histogram to visualise the distribution.

Histogram

viewof binw = Inputs.range([0.01, 5], {step: 0.01, label: "bin width"})

The number of bins does affect the histogram appearance, so explore different values to see how it changes the plot.

A measure of central tendency

A measure of central tendency is a location of the “middle”, “center”, or “expected value” of the distribution of your data.

Sample mean (or average) and median are examples of measures of central tendency
What is the average customer tip?

Sample mean and median

The sample mean or average is:

\[\bar{x} = \frac{1}{n}(x_1+x_2 +\dots + x_n) = \frac{1}{n}\sum_{i=1}^nx_i.\]

The sample median is:

middle number of the sorted observation when \(n\) is odd, and
average of the two middle sorted observations when \(n\) is even.

Sample data: \[54, 71, 57, 70, 53\]

The (sample) mean is \[(54 + 71 + 57 + 70 + 53)/5 = 61.\]

Sorted sample data: \[53, 54, 57, 70, 71\]

So (sample) median is \(57\).

The mean is commonly used
But the median is more robust to extreme observations (outliers).

Skewness

Skewness is a measure of asymmetry in a given distribution

Symmetric

Mean \(\approx\) Median

Positively skewed or
Right skewed

Mean > Median

Negatively skewed or
Left skewed

Mean < Median

Modality

The sample mode is the value with the highest frequency.

Mode is useful for categorical data.
For numerical data, mode is less useful as there may be no repeated values.
However, we can look at the modality of a distribution: number of peaks in the distribution.

Unimodal distribution

Bimodal distribution

Multimodal distribution

Quantiles

A \(p\)-quantile is the value below which \(p\) (where \(0 < p <1\)) proportion of your data lie below.

Note: quantiles do not need to be data values.
Quartiles are special quantiles that divide the data into four equal parts:
- First quartile (\(Q_1\)) or lower quartile is the 0.25 quantile
- Second quartile (\(Q_2\)) or median is the 0.50 quantile
- Third quartile (\(Q_3\)) or upper quartile is the 0.75 quantile

A measure of dispersion

A measure of dispersion/spread is a number representing the spread of data around a measure of central tendency.

E.g. range, interquartile range (IQR), variance, standard deviation.

Measure of dispersions

Sample deviation: the distance of an observation from its mean \(x_i-\bar{x}\)
Sample variance: \[s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2.\]
Sample standard deviation: the square root of sample variance \(s\)
- Conveys similar information as variance, but measure of units is the same as the data
The range is the difference between the maximum and minimum values in the dataset.
The interquartile range (IQR) is the difference between the third quartile and the first quartile (\(Q_3 - Q_1\)).

Population variance: \[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_{i'} - \mu)^2.\]

Boxplots

L = \(Q_1 - 1.5 \times IQR\)
U = \(Q_3 + 1.5 \times IQR\)

Boxplot do not work well for small datasets and certainly not for \(n < 5\).
Boxplots are poor at showing multimodal distributions.

Case study STAT1003 mark distribution

How hard is STAT1003 at ANU for a typical undergraduate student?

Here is a sample assignment and quiz marks:

Five number summary: (55, 80, 88, 93, 100)

Mode: 6

Note: five number summary is (minimum, \(Q_1\), median, \(Q_3\), maximum)
What do you think based on the distribution of marks for assignment and quiz?

Summary

Summary statistics describe main characteristics of the data

Frequency table
Mode
Barplot

Skewness
Modality
Quantiles
A measure of central tendency: mean and median
A measure of dispersion: range, IQR, variance and standard deviation
Histogram
Boxplot

Basic Statistical Concepts and Programming I

Basics of Statistics

What is statistics?

Why study statistics?

Exploratory vs Confirmatory data analysis

Become a data detective

Specify the population and scope

Identify the key variables

Measurement types

Case study STAT1003 student marks

Population vs. Sample

Mathematical setup

Population vs sample mean

Mathematical notations and conventions

Data collection methods

Causal inference

Summary

Getting Started with R

What is R?

Interactively working with R

How do you use R?

Customise Global Options

Arithmetics

Functions

Finding functions

Why learn R?

Base R

Contributed R Packages

Using packages on CRAN

Summary

RStudio Desktop Cheatsheet

Descriptive and Summary Statistics

R package datasets tips data

Initial data analysis

Statistical summary for univariate data

Categorical variables

Numerical summary for a categorical variable

Graphical summary for a categorical variable

Nominal vs. Ordinal variables

Numerical variables

Graphical summary for numerical variable

Histogram

A measure of central tendency

Sample mean and median

Skewness

Modality

Quantiles

A measure of dispersion

Measure of dispersions

Boxplots

Case study STAT1003 mark distribution

Summary

R package datasets `tips` data