Chi-squared Tests

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Acknowledgement

This lecture was partially adapted from the previous STAT1003 lecturers. Thank you folks!

Chi-squared Tests for Goodness-of-Fit

Example: blood type distribution

  • The global distribution of blood types (A, B, AB, O) is A: 40%, B: 25%, AB: 10%, O: 25%.
  • A scientist wants to know whether migration, population history, or selective factors have influenced the local blood type distribution.
  • The observed blood type in 200 randomly selected individuals from a local population is as follows: A: 85, B: 40, AB: 15, O: 60.
  • Does the observed distribution differ from the global distribution?
    • \(H_0\): The observed distribution is the same as the global distribution vs. 
    • \(H_A\): The observed distribution is different from the global distribution.
  • How should we test these hypotheses?

Chi-squared test for goodness-of-fit

  • Let \(O_i\) be the observed count for category \(i\) and \(E_i = n \times p_i\) be the expected count for category \(i\) under \(H_0\) where \(n\) is the total sample size and \(p_i\) is the expected proportion for category \(i\) for \(i = 1, 2, \ldots, k\).
  • Note that \(\sum_{i=1}^k O_i = \sum_{i=1}^k E_i = n\), where \(n\) is the total sample size.

The chi-squared test statistic is defined as:

\[X^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} = \sum_{i=1}^k \frac{O_i^2}{E_i} - n.\]

  • Recall that \(Y \sim \text{Poisson}(\lambda)\) then \(E(Y) = \text{Var}(Y) = \lambda\).
  • We assume that \(\frac{O_i - E_i}{\sqrt{E_i}} \overset{\text{approx.}}{\sim} N(0, 1)\) for each \(i\) provided \(E_i \geq 5\) and cases are independent.
  • Thus, \(X^2 \overset{\text{approx.}}{\sim} \chi^2_{k-1}\) under \(H_0\).

Chi-squared test for goodness-of-fit in R

P-value is calculated as: \(P(\chi^2_{k-1} > X^2)\).

  • The global distribution of blood types (A, B, AB, O) is
    A: 40%, B: 25%, AB: 10%, O: 25%.
  • The observed blood type in 200 randomly selected individuals from a local population is as follows:
    A: 85, B: 40, AB: 15, O: 60.


Category A B AB O
\(O_i\) 85 40 15 60
\(E_i\) 80 50 20 50

Chi-squared Tests for Independence

Case study: iPod

  • Researchers recruited 219 participants in a study where they would sell a used iPod that was known to have frozen twice in the past.
  • The participants were incentivized to get as much money as they could for the iPod.
  • The researchers wanted to understand what types of questions would elicit the seller to disclose the freezing issue.
  • Unbeknownst to the participants who were the sellers in the study, the buyers were collaborating with the researchers to evaluate the influence of different questions on the likelihood of getting the sellers to disclose the past issues with the iPod.
  • The scripted buyers asked one of three questions:

    • General: What can you tell me about it?
    • Positive Assumption: It doesn’t have any problems, does it?
    • Negative Assumption: What problems does it have?
General Positive Negative Total
Disclose Problem 2 23 36 61
Hide Problem 71 50 37 158
Total 73 73 73 219

Hypotheses

  • \(H_0\): The buyer’s question and the seller’s behavior are independent.
  • \(H_A\): The buyer’s question and the seller’s behavior are not independent.
  • Recall that events \(A\) and \(B\) are independent if: \(P(A \cap B) = P(A)P(B)\).
  • Alternatively, we can write the hypotheses in terms of the probabilities:
  • \(H_0\):
    • \(P(\text{Disclose} \cap \text{General}) = P(\text{Disclose})P(\text{General})\)
    • \(P(\text{Disclose} \cap \text{Positive}) = P(\text{Disclose})P(\text{Positive})\)
    • \(P(\text{Disclose} \cap \text{Negative}) = P(\text{Disclose})P(\text{Negative})\)
    • \(P(\text{Hide} \cap \text{General}) = P(\text{Hide})P(\text{General})\)
    • \(P(\text{Hide} \cap \text{Positive}) = P(\text{Hide})P(\text{Positive})\)
    • \(P(\text{Hide} \cap \text{Negative}) = P(\text{Hide})P(\text{Negative})\)
  • \(H_A\): At least one of the probabilities are not equal.

Expected counts under \(H_0\)

  • Under \(H_0\) we have \(E_{ij} = n \times P(\text{row } i) \times P(\text{column } j)\) for \(i = 1, 2\) and \(j = 1, 2, 3\).
  • We estimate the probabilities by the sample proportions: \(P(\text{row } i) = \frac{\text{row } i \text{ total}}{n}\) and \(P(\text{column } j) = \dfrac{\text{column } j \text{ total}}{n}\).
  • E.g. \(E_{11} = E_{12} = E_{13} = 219 \times \dfrac{61}{219} \times \dfrac{73}{219} = 20.33\) and \(E_{21} = E_{22} = E_{23} = 219 \times \dfrac{158}{219} \times \dfrac{73}{219} = 52.67\).
  • We can summarize the observed counts and the expected counts in a table:
General Positive Negative Total
Disclose Problem 2 (20.33) 23 (20.33) 36 (20.33) 61
Hide Problem 71 (52.67) 50 (52.67) 37 (52.67) 158
Total 73 73 73 219

P-value for the chi-squared test for independence

  • For independence tests, we calculate the test statistic using \[X^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\] where \(R\) and \(C\) are the number of levels in the row variable and column variable.
  • Under \(H_0\), \(X^2 \overset{\text{approx.}}{\sim} \chi^2_{(R-1)(C-1)}\) provided that \(E_{ij} \geq 5\) for all \(i\) and \(j\) and cases are independent.

\(\chi^2 = \frac{(2 - 20.33)^2}{20.33} + \frac{(23 - 20.33)^2}{20.33} + \frac{(36 - 20.33)^2}{20.33} + \frac{(71 - 52.67)^2}{52.67} + \frac{(50 - 52.67)^2}{52.67} + \frac{(37 - 52.67)^2}{52.67} = 35.86\)

  • The p-value is \(P(\chi^2_2 > 35.86)\) which is very small (less than 0.001).
  • We reject the null and conclude that the data provide convincing evidence that the question asked did affect a seller’s likelihood to tell the truth about problems with the iPod.

Chi-square test for independence in R

Summary

  • The chi-squared test for goodness-of-fit is used to test whether the observed distribution of a categorical variable (with \(k\) levels) differs from an expected distribution. \[X^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \overset{\text{approx.}}{\sim} \chi^2_{k-1}\text{ under } H_0\] where \(E_i = n \times p_i\) is the expected count for category \(i\) under \(H_0\).

  • The chi-squared test for independence is used to test whether there is an association between two categorical variables. \[X^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \overset{\text{approx.}}{\sim} \chi^2_{(R-1)(C-1)}\text{ under } H_0\] where \(E_{ij} = n \times P(\text{row } i) \times P(\text{column } j)\) is the expected count for cell \((i, j)\) under \(H_0\).

  • It is important to check the assumptions of the chi-squared tests, such as having a sufficiently large sample size and expected counts in each category.