Paired data

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Independent vs. paired samples

  • Are men earning more than women?
  • A random sample of men and women are selected and their incomes are recorded.
  • This is an example of independent samples.
  • Are husbands earning more than their wives?
  • A random sample of husbands and their wives are selected and their incomes are recorded.
  • This is an example of paired samples.
  • Whether you should collect independent samples or paired samples will be determined by your research question.

Example: textbook prices

Are the prices of textbooks on Amazon close to ones in the UCLA bookstore?

73 textbooks were identified as required for UCLA courses.

The price of each textbook on Amazon and the university bookstore was recorded.

Course ISBN
Price (USD)
UCLA Amazon Difference
C139 978-0520224759 18.75 20.21 -1.46
104 978-0470419977 174.00 143.75 30.25
C170 978-0803272620 27.67 27.95 -0.28
M422 978-0761918479 88.42 97.95 -9.53
127B 978-0324828641 214.50 173.56 40.94
10 978-0195181234 24.70 16.47 8.23
M104D 978-1412969666 89.71 85.45 4.26
M132B 978-0205753383 55.13 42.68 12.45
M104C 978-0582276024 16.00 11.67 4.33
124 978-0078136634 183.75 145.40 38.35

Paired data

  • The price of the same textbook on Amazon and the university bookstore are paired data.
  • Let:
    • \(X_{1i}\) be the price of the \(i\)-th textbook at the university bookstore,
    • \(X_{2i}\) be the price of the \(i\)-th textbook on Amazon, and
    • \(X_{di} = X_{1i} - X_{2i}\) be the price difference for the \(i\)-th textbook.
  • For paired data, we are often interested in the difference between the two responses, \(D_i\), effectively reducing the problem to a one-sample.

Inference for paired data

  • Suppose that
    • \(\mu_1\) is the mean price of the textbooks at the university bookstore,
    • \(\mu_2\) is the mean price of the textbooks on Amazon, and
    • \(\mu_d = \mu_1 - \mu_2\) is the mean price difference between the two sources.
  • A \(100(1-\alpha)\%\) confidence interval for \(\mu_d\) is given by: \[\class{highlight mark-yellow}{\bar{X}_d \pm t_{n-1, \alpha / 2} \frac{s_d}{\sqrt{n}}}\] where
    • \(\bar{X}_d\) is the sample mean of the differences, and
    • \(s_d\) is the sample standard deviation of the differences, and
    • \(n\) is the number of pairs.
Source n Mean Price Standard Deviation
UCLA 73 72.22 59.66
Amazon 73 59.46 49.00
Difference 73 12.76 14.26
  • Assume that the price differences are independent and normally distributed.
  • A 95% confidence interval for \(\mu_d\) is given by:

\(12.76 \pm t_{72, 0.025} \times \dfrac{14.26}{\sqrt{73}}\) \(\approx\) \((9.44, 16.09).\)

  • Therefore we are 95% confident that the mean price difference between the university bookstore and Amazon is between $9.44 and $16.09.

Hypothesis testing for paired data

\(H_0: \mu_d = 0 \text{ vs. } H_A: \mu_d \neq 0\)

\[\class{highlight mark-yellow}{T = \dfrac{\bar{X}_d}{s_d / \sqrt{n}}} \sim t_{n-1} \text{ under } H_0\]

  • P-value \(= 2P(T \geq |t^*|)\) where \(t^*\) is the observed value of the test statistic.
  • \(t^* = \dfrac{12.76}{14.26/\sqrt{73}} \approx 7.65\)

  • P-value \(= 2P(T \geq 7.65) \approx 0.0000\).

  • Since P-value \(< 0.05\), there is evidence to support that the prices of textbooks on Amazon are significantly different from the ones in the UCLA bookstore.

Summary

scroll

\(H_0: \mu_d = 0\)

  • For paired data, we are often interested in the difference between the two responses, effectively reducing the problem to a one-sample.
  • Observe a pair of \(n\) samples with sample mean difference \(\bar{X}_d\) and sample standard deviation of the differences \(s_d\) from the population with mean difference \(\mu_d\) and standard deviation of the differences \(\sigma_d\).

Confidence interval for \(\mu_d\)

Variance known

\[\bar{X}_d \pm z_{\alpha / 2} \frac{\sigma_d}{\sqrt{n}}\]

Variance unknown

\[\bar{X}_d \pm t_{n - 1, \alpha / 2} \frac{s_d}{\sqrt{n}}\]

P-value

Variance known

\(Z = \dfrac{\bar{X}_d}{\sigma_d / \sqrt{n}} \sim N(0, 1)\) under \(H_0\).

\(H_A\) P-value
\(H_A: \mu_d \neq 0\) \(P(|Z| \geq |z^*|)\)
\(H_A: \mu_d > 0\) \(P(Z \geq z^*)\)
\(H_A: \mu_d < 0\) \(P(Z \leq z^*)\)

Variance unknown

\(T = \dfrac{\bar{X}_d}{s_d / \sqrt{n}} \sim t_{n - 1}\) under \(H_0\).

\(H_A\) P-value
\(H_A: \mu_d \neq 0\) \(P(|T| \geq |t^*|)\)
\(H_A: \mu_d > 0\) \(P(T \geq t^*)\)
\(H_A: \mu_d < 0\) \(P(T \leq t^*)\)