Sampling Distribution

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Acknowledgement

This lecture was partially adapted from the previous STAT1003 lecturers. Thank you folks!

Sampling Distribution

Recall: statistical inference

Statistical inference: Infer parameters of a population from a sample from that population.

Population parameters describe populations, e.g. population mean \(\mu\) and population variance \(\sigma^2\).
Population parameters are usually unknown.
To find out more about them, we take a random sample of data.
Calculate statistics from the sample, e.g. sample mean \(\bar{x}\) and sample variance \(s^2\).
Use \(\bar{x}\) and \(s^2\) to estimate \(\mu\) and \(\sigma^2\).

Sampling distribution

Sample statistics, like the sample mean (\(\bar{X}\)) and sample variance (\(S^2\)), are random variables.

Sampling distribution refers to the distribution of a statistic that would arise if we repeatedly took random samples from a population.

Sampling distribution of a sample mean

Expected value of a sample mean

Suppose \(X_1, X_2, \ldots, X_n\) are independent random variables drawn from a population with mean \(\mu\) and finite variance \(\sigma^2\) (we refer to this as independent and identically distributed or i.i.d.).
Then the sample mean \(\bar{X}\) is given by:

\[ \bar{X} = \frac{1}{n}(X_1 + X_2 + \ldots + X_n) \]

The expected value of the sample mean \(\bar{X}\) is equal to the population mean \(\mu\):

\[ \begin{align*} \text{E}(\bar{X}) &= \text{E}\left(\frac{1}{n}(X_1 + X_2 + \ldots + X_n)\right) \\ &= \frac{1}{n}(\text{E}(X_1) + \text{E}(X_2) + \ldots + \text{E}(X_n)) \\ &= \frac{1}{n}(n\mu) = \mu \end{align*} \]

Standard error

The standard error (SE) of a statistic is the standard deviation of its sampling distribution.

For the sample mean \(\bar{X}\):

\[ \begin{align*} \left[\text{SE}(\bar{X})\right]^2 &= \text{Var}\left(\frac{1}{n}(X_1 + X_2 + \ldots + X_n)\right) \\ &= \frac{1}{n^2}\text{Var}(X_1 + X_2 + \ldots + X_n) \\ &= \frac{\sigma^2}{n} \end{align*} \]

But since \(\sigma\) is usually unknown, we use the sample standard deviation \(s\) to estimate it. Thus, we have \(\widehat{\text{SE}}(\bar{X}) = \dfrac{s}{\sqrt{n}}\).

Exact sampling distribution 1

When dealing with experiments that are random and well-defined in a purely theoretical setting, we are able to find the exact sampling distribution.

Suppose we roll a fair four-sided die and let \(X\) be the number that comes up.
The population can be thought of as an infinite number of rolls of the die.
If we repeatedly draw samples of size 2, i.e., roll the die twice, find the sampling distribution of the sample mean \(\bar{X}\).

The population distribution of \(X\) is:

\(x\)	\(1\)	\(2\)	\(3\)	\(4\)
\(P(X = x)\)	\(\frac{1}{4}\)	\(\frac{1}{4}\)	\(\frac{1}{4}\)	\(\frac{1}{4}\)

\(E(X) = 2.5\) and \(\text{Var}(X) = 1.25\).

Exact sampling distribution 2

How about the sampling distribution for sample mean \(\bar{X}\) when the sample size \(n = 2\)?
The possible values of \(\bar{X}\) are:

First roll	Second roll
First roll	1	2	3	4
1	1.0	1.5	2.0	2.5
2	1.5	2.0	2.5	3.0
3	2.0	2.5	3.0	3.5
4	2.5	3.0	3.5	4.0

The (exact) sampling distribution of \(\bar{X}\) is:

\(\bar{x}\)	\(1\)	\(1.5\)	\(2\)	\(2.5\)	\(3\)	\(3.5\)	\(4\)
\(P(\bar{X} = \bar{x})\)	\(\frac{1}{16}\)	\(\frac{2}{16}\)	\(\frac{3}{16}\)	\(\frac{4}{16}\)	\(\frac{3}{16}\)	\(\frac{2}{16}\)	\(\frac{1}{16}\)

\(E(\bar{X}) = 2.5\) and \(\text{Var}(\bar{X}) = 0.625\).

Distribution of a Sample Mean

Distribution of a sample mean

Suppose \(X_1, X_2, \ldots, X_n\) are i.i.d. with mean \(\mu\) and variance \(\sigma^2\).
Then \(\text{E}(\bar{X}) = \mu\) and \(\text{Var}(\bar{X}) = \dfrac{\sigma^2}{n}\).

Let’s assume that the population is:
1. \(N(\mu, \sigma^2)\)
2. \(U(a, b)\) so \(\mu = \dfrac{a + b}{2}\) and \(\sigma^2 = \dfrac{(b - a)^2}{12}\).
3. \(B(m, p)\) so \(\mu = m p\) and \(\sigma^2 = m p (1 - p)\).
4. \(\text{Poisson}(\lambda)\) so \(\mu = \lambda\) and \(\sigma^2 = \lambda\).

We then examine the empirical distribution of the sample mean.

Case 1: \(N(\mu, \sigma^2)\)

viewof mu = Inputs.range([-5, 5], {step: 1, label: "mu"})
viewof sigma = Inputs.range([1, 5], {step: 1, label: "sigma"})

Case 2: \(U(a, b)\)

viewof a = Inputs.range([-5, 0], {step: 1, label: "a"})
viewof b = Inputs.range([a, 5], {step: 1, value: 5, label: "b"})

Case 3: \(B(m, p)\)

viewof m = Inputs.range([1, 100], {value: 20, step: 1, label: "m"})
viewof p = Inputs.range([0, 1], {step: 0.01, label: "p"})

Case 4: \(\text{Poisson}(\lambda)\)

viewof lambda = Inputs.range([1, 20], {step: 1, label: "lambda"})

Sampling distribution of the sample mean

viewof n = Inputs.range([1, 100], {value: 20, step: 1, label: "n"})

What do you notice about the sampling distributions of the sample means?

Central Limit Theorem

Central limit theorem

If a random variable is the mean of independent random values, then that value will follow a normal distribution regardless of how the individual terms are distributed.

Or mathematically, let \(X_1, \ldots, X_n\) be a random sample of \(n\) independent observations from a population with mean \(\mu\) and variance \(\sigma^2\). Then, as \(n \to \infty\),

\[\bar{X} \sim N\!\left(\mu,\; \frac{\sigma^2}{n}\right)\]

A number of distribution in nature appears to conform a normal distribution (if you ignore the fact that some values can never be negative).
This could be because the observation is the result of the sum of many independent random variables.

When can we approximate with Normal distribution?

Independence: observations must be independent.
Sample size \(n\):
- If \(X\) is normally distributed, \(\bar{X}\) is normal for all \(n\).
- If \(X\) is not normally distributed, \(\bar{X}\) is approximately normal for large \(n\).:

\(n < 30\): data should be nearly normal with no clear outliers.
\(n \ge 30\): sampling distribution of \(\bar{X}\) is approximately normal unless there are extreme outliers.

Case 1: \(N(\mu, \sigma^2)\)

viewof n_1 = Inputs.range([1, 200], {value: 20, step: 1, label: "n_1"})
viewof mu_1 = Inputs.range([-5, 5], {step: 1, label: "mu_1"})
viewof sigma_1 = Inputs.range([1, 5], {step: 1, label: "sigma_1"})

Case 2: \(U(a, b)\)

viewof n_2 = Inputs.range([1, 200], {value: 20, step: 1, label: "n_2"})
viewof a_2 = Inputs.range([-5, 0], {step: 1, label: "a_2"})
viewof b_2 = Inputs.range([a_2, 5], {step: 1, value: 5, label: "b_2"})

Case 3: \(B(m, p)\)

viewof n_3 = Inputs.range([1, 200], {value: 20, step: 1, label: "n_3"})
viewof m_3 = Inputs.range([1, 100], {value: 20, step: 1, label: "m_3"})
viewof p_3 = Inputs.range([0, 1], {step: 0.01, label: "p_3"})

Case 4: \(\text{Poisson}(\lambda)\)

viewof n_4 = Inputs.range([1, 200], {value: 20, step: 1, label: "n_4"})
viewof lambda_4 = Inputs.range([1, 20], {step: 1, label: "lambda_4"})

Example: Physician Part 1

The time a family physician spends seeing a patient follows a right-skewed distribution with mean 15 minutes and standard deviation 11.6 minutes.

What is the probability that the physician spends less than 12 minutes seeing a patient?

Let \(X\) be consultation time.
We want \(P(X < 12)\), but the distribution of \(X\) is unknown.
It is not possible to calculate \(P(X < 12)\).

Example: Physician Part 2

The time a family physician spends seeing a patient follows a right-skewed distribution with mean 15 minutes and standard deviation 11.6 minutes.

What is the probability that the physician spends an average time of less than 12 minutes with her 30 patients?

Let \(\bar{X}\) be the mean consultation time.
By CLT, \[\bar{X} \overset{\text{approx.}}{\sim} N\!\left(15, \frac{11.6^2}{30}\right)\]
We want \(P(\bar{X} < 12)\).

Example: Physician Part 3

The time a family physician spends seeing a patient follows a right-skewed distribution with mean 15 minutes and standard deviation 11.6 minutes.

One day, 35 patients have appointments. What is the probability the physician works overtime beyond 8 hours?

Let \(X_i\) be the consultation time for patient \(i\).
\(P\!\left(\sum_{i=1}^{35} X_i > 480\right) = P\left(\bar{X} > \frac{480}{35}\right)\)
Again by CLT, \[\bar{X} \overset{\text{approx.}}{\sim} N\!\left(15, \frac{11.6^2}{35}\right)\]
We want \(P\left(\bar{X} < \frac{480}{35}\right)\).

Example: Passengers

The number of passengers passing through a large South East Asian airport is normally distributed with a mean of 110,000 persons per day and a standard deviation of 20,200 persons. If you select a random sample of 16 days:

What is the probability that \(\bar{X}\) is between 102,000 and 104,500 passengers per day?
The probability is 60% that \(\bar{X}\) will be between which two values symmetrically distributed around the population mean?

\(\bar{X} \sim N\!\left(110000, \frac{20200^2}{16}\right)\).
\(P(102000 < \bar{X} < 104500) = P(\bar{X} < 104500) - P(\bar{X} < 102000)\)
\(P(|\bar{X} - 110000| < q) = 0.6\) so \(P\left(|Z| < \frac{q}{20200 / \sqrt{16}}\right) = 0.6 \rightarrow P(Z < \frac{q}{20200 / \sqrt{16}}) = 0.8\).

Distribution of a Sample Proportion

Sampling distribution of the sample proportion

If \(X \sim B(n, p)\), then \(X\) is the number of successes.
We estimate \(p\) using the sample proportion \[\hat{p} = \frac{X}{n}.\]
\(\hat{p}\) is a random variable.

The true (population) prime minister support rate \(p\) is unknown. We could take a poll and find the sample approval rate \(\hat{p}\) as an estimate.

The true (population) proportion of people who prefer Coke than Pepsi is unknown. We could randomly select a sample of people and calculate the sample proportion \(\hat{p}\).

CLT for the sample proportion

\[ X = X_1 + \cdots + X_n, \quad X_i = \begin{cases} 1 & \text{success} \\ 0 & \text{failure} \end{cases} \]

\[ \hat{p} = \frac{X_1 + \cdots + X_n}{n} \]

\(\hat{p}\) is the sample mean of Bernoulli variables.
CLT applies.

For \(X \sim B(n, p)\),

\(X \overset{\text{approx.}}{\sim} N\left(np, np(1-p)\right)\qquad\) and \(\qquad\hat{p} \overset{\text{approx.}}{\sim} N\!\left(p, \dfrac{p(1-p)}{n}\right)\)

provided sample size is sufficiently large.

Success-failture condition: \(np \ge 10\) and \(n(1-p) \ge 10\).

Example: Smoker

Approximately 15% of the US population smokes cigarettes.
A local government believed their community had a lower smoker rate and commissioned a survey of 400 randomly selected individuals.
The survey found that only 42 of the 400 participants smoke cigarettes.
If the true proportion of smokers in the community was really 15%, what is the probability of observing 42 or fewer smokers in a sample of 400 people?

The true proportion of smokers is \(p = 0.15\).
The sample size is \(n = 400\).
We want to calculate \(P(\hat{p} \le \frac{42}{400} = 0.105)\).
As \(n\) is large, \(\hat{p} \overset{\text{approx.}}{\sim} N\!\left(0.15, \dfrac{0.15 \times 0.85}{400}\right)\).

Example: Women on Corporate Boards

Over the past few years there has been increased monitoring of the representation of women on corporate boards. Suppose that the true percentage of women of ASX 200 boards is now 24.6% and that a random sample of 220 board members is chosen.

What is the probability that in the sample less than 24% of board members will be women?
If a sample of 100 is taken, how does this change your answer to above?

The sample proportion \(\hat{p} \overset{\text{approx.}}{\sim} N\!\left(0.246, \dfrac{0.246 \times 0.754}{220}\right)\).
\(P(\hat{p} < 0.24)\)
If \(n = 100\), then \(\hat{p} \overset{\text{approx.}}{\sim} N\!\left(0.246, \dfrac{0.246 \times 0.754}{100}\right)\) and \(P(\hat{p} < 0.24)\) is smaller.

Summary

Sampling distribution is the distribution of a statistic that would arise if we repeatedly took random samples from a population.
If \(X\) is a random variable with mean \(\mu\) and variance \(\sigma^2\), then the sample mean of \(n\) samples, \(\bar{X}\), is also a random variable with \(E(\bar{X}) = \mu\) and \(\text{Var}(\bar{X}) = \sigma^2 / n\).
The central limit theorem states that the sampling distribution of \(\bar{X} \sim N\!\left(\mu, \dfrac{\sigma^2}{n}\right)\) for large \(n\). Rule of thumb: \(n \geq 30\).
If \(X \sim B(n, p)\), the sampling distribution of the sample proportion \(\hat{p} = \dfrac{X}{n} \sim N\!\left(p, \dfrac{p(1-p)}{n}\right).\) Success-failure condition: \(np \geq 10\) and \(n(1-p) \geq 10\).