Paired data

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Independent vs. paired samples

Are men earning more than women?
A random sample of men and women are selected and their incomes are recorded.
This is an example of independent samples.

Are husbands earning more than their wives?
A random sample of husbands and their wives are selected and their incomes are recorded.
This is an example of paired samples.

Whether you should collect independent samples or paired samples will be determined by your research question.

Example: textbook prices

Are the prices of textbooks on Amazon close to ones in the UCLA bookstore?

73 textbooks were identified as required for UCLA courses.

The price of each textbook on Amazon and the university bookstore was recorded.

Course	ISBN	Price (USD)
Course	ISBN	UCLA	Amazon	Difference
C139	978-0520224759	18.75	20.21	-1.46
104	978-0470419977	174.00	143.75	30.25
C170	978-0803272620	27.67	27.95	-0.28
M422	978-0761918479	88.42	97.95	-9.53
127B	978-0324828641	214.50	173.56	40.94
10	978-0195181234	24.70	16.47	8.23
M104D	978-1412969666	89.71	85.45	4.26
M132B	978-0205753383	55.13	42.68	12.45
M104C	978-0582276024	16.00	11.67	4.33
124	978-0078136634	183.75	145.40	38.35

Paired data

The price of the same textbook on Amazon and the university bookstore are paired data.

Let:
- $X_{1i}$ be the price of the $i$-th textbook at the university bookstore,
- $X_{2i}$ be the price of the $i$-th textbook on Amazon, and
- $X_{di} = X_{1i} - X_{2i}$ be the price difference for the $i$-th textbook.

For paired data, we are often interested in the difference between the two responses, $D_i$, effectively reducing the problem to a one-sample.

Inference for paired data

Suppose that
- $\mu_1$ is the mean price of the textbooks at the university bookstore,
- $\mu_2$ is the mean price of the textbooks on Amazon, and
- $\mu_d = \mu_1 - \mu_2$ is the mean price difference between the two sources.
A $100(1-\alpha)\%$ confidence interval for $\mu_d$ is given by: \[\class{highlight mark-yellow}{\bar{X}_d \pm t_{n-1, \alpha / 2} \frac{s_d}{\sqrt{n}}}\] where
- $\bar{X}_d$ is the sample mean of the differences, and
- $s_d$ is the sample standard deviation of the differences, and
- $n$ is the number of pairs.

Source	n	Mean Price	Standard Deviation
UCLA	73	72.22	59.66
Amazon	73	59.46	49.00
Difference	73	12.76	14.26

Assume that the price differences are independent and normally distributed.
A 95% confidence interval for $\mu_d$ is given by:

$12.76 \pm t_{72, 0.025} \times \dfrac{14.26}{\sqrt{73}}$ $\approx$ $(9.44, 16.09).$

Therefore we are 95% confident that the mean price difference between the university bookstore and Amazon is between $9.44 and $16.09.

Hypothesis testing for paired data

$H_0: \mu_d = 0 \text{ vs. } H_A: \mu_d \neq 0$

\[\class{highlight mark-yellow}{T = \dfrac{\bar{X}_d}{s_d / \sqrt{n}}} \sim t_{n-1} \text{ under } H_0\]

P-value $= 2P(T \geq |t^*|)$ where $t^*$ is the observed value of the test statistic.

$t^* = \dfrac{12.76}{14.26/\sqrt{73}} \approx 7.65$
P-value $= 2P(T \geq 7.65) \approx 0.0000$.
Since P-value $< 0.05$, there is evidence to support that the prices of textbooks on Amazon are significantly different from the ones in the UCLA bookstore.

Summary

scroll

$H_0: \mu_d = 0$

For paired data, we are often interested in the difference between the two responses, effectively reducing the problem to a one-sample.
Observe a pair of $n$ samples with sample mean difference $\bar{X}_d$ and sample standard deviation of the differences $s_d$ from the population with mean difference $\mu_d$ and standard deviation of the differences $\sigma_d$.

Confidence interval for $\mu_d$

Variance known

\[\bar{X}_d \pm z_{\alpha / 2} \frac{\sigma_d}{\sqrt{n}}\]

Variance unknown

\[\bar{X}_d \pm t_{n - 1, \alpha / 2} \frac{s_d}{\sqrt{n}}\]

P-value

Variance known

$Z = \dfrac{\bar{X}_d}{\sigma_d / \sqrt{n}} \sim N(0, 1)$ under $H_0$.

$H_A$	P-value
$H_A: \mu_d \neq 0$	$P(\|Z\| \geq \|z^*\|)$
$H_A: \mu_d > 0$	$P(Z \geq z^*)$
$H_A: \mu_d < 0$	$P(Z \leq z^*)$

Variance unknown

$T = \dfrac{\bar{X}_d}{s_d / \sqrt{n}} \sim t_{n - 1}$ under $H_0$.

$H_A$	P-value
$H_A: \mu_d \neq 0$	$P(\|T\| \geq \|t^*\|)$
$H_A: \mu_d > 0$	$P(T \geq t^*)$
$H_A: \mu_d < 0$	$P(T \leq t^*)$

\(H_A\)	P-value
\(H_A: \mu_d \neq 0\)	\(P(\|Z\| \geq \|z^*\|)\)
\(H_A: \mu_d > 0\)	\(P(Z \geq z^*)\)
\(H_A: \mu_d < 0\)	\(P(Z \leq z^*)\)

\(H_A\)	P-value
\(H_A: \mu_d \neq 0\)	\(P(\|T\| \geq \|t^*\|)\)
\(H_A: \mu_d > 0\)	\(P(T \geq t^*)\)
\(H_A: \mu_d < 0\)	\(P(T \leq t^*)\)