
STAT1003 – Statistical Techniques
Dr. Emi Tanaka
Australian National University
These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.
Bivariate data involves two different variables (\(x\) and \(y\)) for each observation.
| Categorical | Numerical | |
|---|---|---|
| Categorical |
|
|
| Numerical |
|
|
A contingency table (also known as a cross-tabulation or crosstab) display the frequency distribution of two or more categorical variables.
What do you notice between these two approaches?
A stacked barplot is used to compare the composition of different groups in a dataset, especially contribution of sub-categories to the total within each main category.
A percent stacked barplot is ideal for comparing the relative frequencies of subgroups within categories, rather than their absolute counts.
A side-by-side barplot (also called a grouped barplot) is used to visually compare the values of different subgroups across categories.
For a bivariate data where one variable is numerical and the other is categorical, you can use summary statistics for univariate data for each group.
Beeswarm plot is a type of scatterplot that shows the distribution of data points while avoiding overlap, making it easier to visualize the density and spread of the data.
A dataset was collected to investigate morphological characteristics associated with seed weight in a line of diploid wheat (Triticum monococcum).
DSeed - identifier for each seedWeight - weight of seed (mg)Length - length of seed (mm)Diameter - diameter of seed (mm)Moisture - mositure content of seed (as a percentage)Hardness - endosperm hardnessA scatterplot is a graphical representation that displays the relationship between two numerical variables by plotting individual data points on a two-dimensional graph.
Sample covariance is a measure of how much two numerical variables change together.
\[ s_{xy}=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right) \]
Consider the following dataset with two variables, \(x\) and \(y\):
| \(i\) | \(x\) | \(y\) |
|---|---|---|
| 1 | 1 | 10 |
| 2 | 2 | 70 |
| 3 | 3 | 100 |
\[s_{xy} = \frac{1}{2}\left[(1-2)(10-60)+(2-2)(70-60)+(3-2)(100-60)\right]=45\]
\[r = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2\sum_{i=1}^n(y_i - \bar{y})^2}}\]
| \(|r|\) | Interpretation |
|---|---|
| 0.8 - 1.0 | Very strong association |
| 0.6 - 0.8 | Strong association |
| 0.4 - 0.6 | Moderate association |
| 0.2 - 0.4 | Weak association |
| 0.0 - 0.2 | Very weak association |

Source: xkcd
Just because \(x\) and \(y\) are highly correlated, it does not mean that \(x\) causes \(y\) or vice versa – correlation is not causation!

\[r = 0.0414779\]


| Categorical | Numerical | |
|---|---|---|
| Categorical |
|
|
| Numerical |
|
|

STAT1003 – Statistical Techniques