These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .

Press the right arrow to progress to the next slide!

1/23

ETC5521: Exploratory Data Analysis

Making comparisons between groups and strata

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 6 - Session 1

1/23

Case study 1 Pest resistance maize Part 1/2

Pests, like thrips and spiders, damage maize crops. Note: Maize = Corn
One strategy to protect crops against pests is to cultivate genetically modified (GM) maize that expresses a toxic protein.

Thrips	Spiders
16.6	0.8
16.4	0.8
11.0	0.6
16.8	0.4
10.6	0.6
18.4	0.8
14.2	0.0
10.2	0.6

The species abundance on 8 Bt GM maize is shown.

Kevin Wright (2018). agridat: Agricultural Datasets. R package version 1.16.

L. A. Hothorn, 2005. Evaluation of Bt-Maize Field Trials by a Proof of Safety. http://www.seedtest.org/upload/cms/user/presentation7Hothorn.pdf

2/23

Case study 1 Pest resistance maize Part 1/2

Pests, like thrips and spiders, damage maize crops. Note: Maize = Corn
One strategy to protect crops against pests is to cultivate genetically modified (GM) maize that expresses a toxic protein.

Thrips	Spiders
16.6	0.8
16.4	0.8
11.0	0.6
16.8	0.4
10.6	0.6
18.4	0.8
14.2	0.0
10.2	0.6

The species abundance on 8 Bt GM maize is shown.
Is the strategy working?

Kevin Wright (2018). agridat: Agricultural Datasets. R package version 1.16.

L. A. Hothorn, 2005. Evaluation of Bt-Maize Field Trials by a Proof of Safety. http://www.seedtest.org/upload/cms/user/presentation7Hothorn.pdf

2/23

Case study 1 Pest resistance maize Part 1/2

Pests, like thrips and spiders, damage maize crops. Note: Maize = Corn
One strategy to protect crops against pests is to cultivate genetically modified (GM) maize that expresses a toxic protein.

Thrips	Spiders
16.6	0.8
16.4	0.8
11.0	0.6
16.8	0.4
10.6	0.6
18.4	0.8
14.2	0.0
10.2	0.6

The species abundance on 8 Bt GM maize is shown.
Is the strategy working?
Well it didn't completely eliminate pests but did it lower the abundance?

Kevin Wright (2018). agridat: Agricultural Datasets. R package version 1.16.

L. A. Hothorn, 2005. Evaluation of Bt-Maize Field Trials by a Proof of Safety. http://www.seedtest.org/upload/cms/user/presentation7Hothorn.pdf

2/23

Case study 1 Pest resistance maize Part 1/2

Pests, like thrips and spiders, damage maize crops. Note: Maize = Corn
One strategy to protect crops against pests is to cultivate genetically modified (GM) maize that expresses a toxic protein.

Thrips	Spiders
16.6	0.8
16.4	0.8
11.0	0.6
16.8	0.4
10.6	0.6
18.4	0.8
14.2	0.0
10.2	0.6

The species abundance on 8 Bt GM maize is shown.
Is the strategy working?
Well it didn't completely eliminate pests but did it lower the abundance?
We can't tell without knowing what the typical abundance is.

Kevin Wright (2018). agridat: Agricultural Datasets. R package version 1.16.

L. A. Hothorn, 2005. Evaluation of Bt-Maize Field Trials by a Proof of Safety. http://www.seedtest.org/upload/cms/user/presentation7Hothorn.pdf

2/23

Case study 1 Pest resistance maize Part 1/2

Pests, like thrips and spiders, damage maize crops. Note: Maize = Corn
One strategy to protect crops against pests is to cultivate genetically modified (GM) maize that expresses a toxic protein.

Thrips	Spiders
16.6	0.8
16.4	0.8
11.0	0.6
16.8	0.4
10.6	0.6
18.4	0.8
14.2	0.0
10.2	0.6

The species abundance on 8 Bt GM maize is shown.
Is the strategy working?
Well it didn't completely eliminate pests but did it lower the abundance?
We can't tell without knowing what the typical abundance is.

At the heart of quantitative reasoning is a single question: Compared to what?

Kevin Wright (2018). agridat: Agricultural Datasets. R package version 1.16.

L. A. Hothorn, 2005. Evaluation of Bt-Maize Field Trials by a Proof of Safety. http://www.seedtest.org/upload/cms/user/presentation7Hothorn.pdf

2/23

Case study 1 Pest resistance maize Part 2/2The actual experiment compared Bt variety to the isogenic control variety.
3/23

Case study 1 Pest resistance maize Part 2/2

The actual experiment compared Bt variety to the isogenic control variety.
How would you compare graphically?

data(gathmann.bt, package = "agridat")
df1 <- gathmann.bt %>% 
  pivot_longer(-gen, values_to = "abundance",
               names_to = "species") %>% 
  mutate(species = case_when(species=="thysan" ~ "Thrip",
                             TRUE ~ "Spider"))
skimr::skim(df1)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df1   
## Number of rows             32    
## Number of columns          3     
## _______________________          
## Column type frequency:           
##   character                1     
##   factor                   1     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 species               0             1     5     6     0        2          0
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts     
## 1 gen                   0             1 FALSE          2 Bt: 16, ISO: 16
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 abundance             0             1  6.01  6.36     0   0.6   2.5  10.7  18.4 ▇▂▃▁▂

g1 <- ggplot(df1, aes(gen, abundance, color = species)) +
  geom_point(size = 3) + 
  facet_wrap(~species, scales = "free") + 
  scale_color_discrete_qualitative() + 
  guides(color = FALSE) + 
  labs(x = "", y = "Abundance", tag = "(A)") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
g2 <- ggplot(df1, aes(gen, abundance, color = species)) +
  geom_point(size = 3) + 
  scale_color_discrete_qualitative() + 
  guides(color = FALSE) + 
  labs(x = "", y = "Abundance", tag = "(B)", color = "Species") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
g3 <- ggplot(df1, aes(gen, abundance, color = species)) +
  geom_point(size = 3) + 
  facet_wrap(~species) + 
  scale_color_discrete_qualitative() + 
  labs(x = "", y = "Abundance", tag = "(C)", color = "Species") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
g4 <- ggplot(df1, aes(species, abundance, color = gen)) +
  geom_point(size = 3) + 
  facet_wrap(~gen, scales = "free") + 
  scale_color_discrete_qualitative(palette = "Harmonic") + 
  guides(color = FALSE) + 
  labs(x = "", y = "Abundance", tag = "(D)") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
g5 <- ggplot(df1, aes(species, abundance, color = gen)) +
  geom_point(size = 3) + 
  scale_color_discrete_qualitative(palette = "Harmonic") + 
  guides(color = FALSE) + 
  labs(x = "", y = "Abundance", tag = "(E)") + 
  theme(axis.text.x = element_text(angle = 90))
g6 <- ggplot(df1, aes(species, abundance, color = gen)) +
  geom_point(size = 3) + 
  facet_wrap(~gen) + 
  scale_color_discrete_qualitative(palette = "Harmonic") + 
  labs(x = "", y = "Abundance", tag = "(F)", color = "Genotype")  + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
(g1 + g2 + g3) / (g4 + g5 + g6)

3/23

Comparing like-with-like Part 1

Comparison should be fair - any differences should be due to the factor you wish to investigate.

4/23

Comparing like-with-like Part 1

Comparison should be fair - any differences should be due to the factor you wish to investigate.

Comparable populations and measurements

Abundance is measured for two species: spiders and thrips.
The abundance metric differ between species.
Would you compare the abundance of spiders on a Bt variety to the abundance of thrips on a isogenic variety?

Image source: https://www.infonet-biovision.org/PlantHealth/Pests/Thrips and https://cropwatch.unl.edu/2016/managing-spider-mites-corn-and-soybean

4/23

Case study 2 Maize kernels Part 1/2

Plant pathologist
Associate plant pathologist
Professor of agronomy
Assistant professor of agronomy
Professor of philosophy
Biologist
Biologist (also author)
Assistant in biology
Data entry clerk (a.k.a. "Computer")
Farmer
Professor of plant physiology
Instructor in plant physiology
Assistant in plant physiology
Assistant in plant physiology
Professor of biology

4 maize ears selected.
15 observers asked to classify kernels to (i) starchy yellow, (ii) starchy white, (iii) sweet yellow or (iv) sweet white.
Ear 11 was slightly abnormal due to a fungus attack giving some pinkish tinge to some kernels.
Is Ear 11 different?

Raymond Pearl, 1911. The Personal Equation In Breeding Experiments Involving Certain Characters of Maize, Biol. Bull., 21, 339-366.

5/23

Case study 2 Maize kernels Part 2/2

Plant pathologist
Associate plant pathologist
Professor of agronomy
Assistant professor of agronomy
Professor of philosophy
Biologist
Biologist (also author)
Assistant in biology
Data entry clerk (a.k.a. "Computer")
Farmer
Professor of plant physiology
Instructor in plant physiology
Assistant in plant physiology
Assistant in plant physiology
Professor of biology

All observer are counting the kernels of the same ears, however there are variations across observers.
Notice Observer 1 classifies more kernels as yellow for Ear 11.

6/23

Comparing like-with-like Part 2

Comparable conditions

The variability due to other sources need to be accounted, removed or "averaged" out for a fair comparison.

7/23

Comparing like-with-like Part 3

Comparable variables and sources

Data collected by different sources may have different rules. E.g. in Australia, "a COVID-19 death is defined for surveillance purposes as a death in a probable or confirmed COVID-19 case, unless there is a clear alternative cause of death that cannot be related to COVID19 (e.g. trauma)"^[1]
Do other countries use the same definition?
The COVID-19 death data often have delays in reporting and data would be updated or corrected later.

^[1] https://www.health.gov.au/resources/total-covid-19-cases-and-deaths-by-states-and-territories

8/23

Case study 3 Multi-environment soybean trial

58 soy varieties are grown in four locations in Queensland in 1970 then 1971.
Soy breeders are interested in finding the "best" variety.
So which variety is the best for yield?

data(australia.soybean, package = "agridat")
skimr::skim(australia.soybean)

## ── Data Summary ────────────────────────
##                            Values           
## Name                       australia.soybean
## Number of rows             464              
## Number of columns          10               
## _______________________                     
## Column type frequency:                      
##   factor                   3                
##   numeric                  7                
## ________________________                    
## Group variables            None             
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                            
## 1 env                   0             1 FALSE          8 B70: 58, B71: 58, L70: 58, L71: 58    
## 2 loc                   0             1 FALSE          4 Bro: 116, Law: 116, Nam: 116, Red: 116
## 3 gen                   0             1 FALSE         58 G01: 8, G02: 8, G03: 8, G04: 8        
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate     mean    sd       p0      p25      p50     p75    p100 hist 
## 1 year                  0             1 1970.    0.501 1970     1970     1970.    1971    1971    ▇▁▁▁▇
## 2 yield                 0             1    2.05  0.752    0.282    1.52     2.07     2.56    4.38 ▂▆▇▃▁
## 3 height                0             1    0.883 0.272    0.25     0.708    0.888    1.04    1.73 ▂▆▇▂▁
## 4 lodging               0             1    2.31  0.976    1        1.5      2.25     3       4.75 ▇▅▅▂▁
## 5 size                  0             1   11.1   4.45     4        7.84     9.5     14.0    23.6  ▅▇▂▃▁
## 6 protein               0             1   40.3   2.93    33.2     38.1     40.2     42.2    48.5  ▁▇▇▅▁
## 7 oil                   0             1   19.9   2.67    13.0     18.0     19.8     22.1    26.8  ▁▆▇▆▂

australia.soybean %>% 
  mutate(gen = reorder(gen, yield)) %>% 
  ggplot(aes(gen, yield, color = loc, shape = as.factor(year))) +
  geom_point(size = 3) + 
  labs(x = "Genotype", y = "Yield (tons/hectare)", shape = "Year", color = "Location") +
  scale_color_discrete_qualitative() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

9/23

Types of comparison

Is the interest to find the best variety for a location?
In that case, should the comparison be within location?
Or is the interest to find the overall best variety at any location?
Comparisons may be specific or general.
A different type of comparison may require a different calculation or graphic for investigation.

📊
R

ggplot(australia.soybean, aes(env, yield, group = gen)) +
  geom_point(size = 6, color = "gray") + 
  geom_line(size = 1.3, color = "gray") +
  geom_point(data = filter(australia.soybean, gen %in% c("G49", "G48", "G50")), aes(color = gen), size = 6) + 
  geom_line(data = filter(australia.soybean, gen %in% c("G49", "G48", "G50")), aes(color = gen), size = 1.3) +
  scale_color_discrete_qualitative() + 
  labs(x = "Environment", y = "Yield",
       color = "Genotype")

10/23

Case study 4 Weight of calves with different diets Part 1/3

67 calves born in 1975 across 11 herds are fed of one of three diets with low, medium or high energy with their initial and final weights recorded.
Different graphics and metrics will help to make comparison easier and fair.

data(urquhart.feedlot, package = "agridat")
df4 <- urquhart.feedlot %>% 
  pivot_longer(c(weight1, weight2), names_to = "when", values_to = "weight") %>% 
  mutate(when = factor(as.character(when), 
                       labels = c("initial", "final"),
                       levels = c("weight1", "weight2")),
         diet = factor(diet, levels = c("High", "Medium", "Low")))
skimr::skim(df4)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df4   
## Number of rows             134   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   2     
##   numeric                  3     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts               
## 1 diet                  0             1 FALSE          3 Low: 64, Med: 46, Hig: 24
## 2 when                  0             1 FALSE          2 ini: 67, fin: 67         
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50    p75  p100 hist 
## 1 animal                0             1  34    19.4     1  17.2   34    50.8    67 ▇▇▇▇▇
## 2 herd                  0             1  25.6  10.7     3  19     31    34      36 ▂▁▂▂▇
## 3 weight                0             1 863.  182.    530 705    842. 1027    1248 ▃▇▃▆▃

df4 %>% 
  ggplot(aes(diet, weight, color = diet)) + 
  geom_point(size = 3) + facet_grid(when ~ herd, scale="free_y") + 
  scale_color_discrete_qualitative() + 
  labs(x = "Diet", y = "Weight", title = "Weight by herd, timing and diet") + 
  guides(color = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

11/23

Case study 4 Weight of calves with different diets Part 2/3

Weight data are paired so comparison of initial and final weights should be matched with the animal.

📊
R

ggplot(df4, aes(when, weight, color = diet, group = animal)) + 
  geom_point(size = 3) + facet_wrap(~herd, nrow = 2) + 
  geom_line() + 
  labs(x = "", y = "Weight", color = "Diet")

12/23

Case study 4 Weight of calves with different diets Part 3/3

People are generally better at comparing lengths on a common scale instead of angles (see Cleveland & McGill, 1985)
We could compare the difference in inital and final weight.
Weight gain doesn't take into account the initial weight though.
We could consider computing the relative weight gain with respect to its initial weight.

📊
R

g1 <- urquhart.feedlot %>% 
  mutate(diet = factor(diet, 
                       levels = c("High", "Medium", "Low"))) %>% 
  ggplot(aes(diet, weight2 - weight1, color = diet)) + 
  geom_boxplot() + 
  labs(x = "", y = "Weight gain", color = "Diet") + 
  guides(color = FALSE)
g2 <- urquhart.feedlot %>% 
  mutate(diet = factor(diet, 
                       levels = c("High", "Medium", "Low"))) %>% 
  ggplot(aes(diet, (weight2 - weight1)/weight1, color = diet)) + 
  geom_boxplot() + 
  labs(x = "", y = "Relative weight\ngain", color = "Diet") + 
  guides(color = FALSE)
g1 + g2

Cleveland, William S., and Robert Mc Gill. n.d. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.”

13/23

Case study 5 Swiss bank notes

Comparisons are not always based on point estimates.
Below is the comparison of distribution for the diagonal length of genuine and forged Swiss bank notes.

data(bank, package = "gclus")
df5 <- bank %>% 
  mutate(status = ifelse(Status==0, "genuine", "forgery")) 
skimr::skim(bank)

## ── Data Summary ────────────────────────
##                            Values
## Name                       bank  
## Number of rows             200   
## Number of columns          7     
## _______________________          
## Column type frequency:           
##   numeric                  7     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   mean    sd    p0   p25   p50   p75  p100 hist 
## 1 Status                0             1   0.5  0.501   0     0     0.5   1     1   ▇▁▁▁▇
## 2 Length                0             1 215.   0.377 214.  215.  215.  215.  216.  ▁▇▇▂▁
## 3 Left                  0             1 130.   0.361 129   130.  130.  130.  131   ▁▅▇▇▁
## 4 Right                 0             1 130.   0.404 129   130.  130   130.  131.  ▃▆▇▅▁
## 5 Bottom                0             1   9.42 1.44    7.2   8.2   9.1  10.6  12.7 ▇▆▅▅▂
## 6 Top                   0             1  10.7  0.803   7.7  10.1  10.6  11.2  12.3 ▁▂▆▇▃
## 7 Diagonal              0             1 140.   1.15  138.  140.  140.  142.  142.  ▂▇▅▅▇

g1 <- ggplot(df5, aes(Diagonal, fill = status)) + 
        geom_histogram(binwidth = 0.2, color = "white") +
        facet_grid(status ~ . ) + 
  labs(x = "Diagonal length (mm)", 
       y = "Count") + 
  guides(fill = FALSE) + 
  scale_fill_manual(values = c("#C7A76C", "#7DB0DD"))
g1

14/23

Comparing graphically Part 1

From (A) we can see that the diagonal length distribution is quite different between forged and genuine notes.
Comparing (B) and (C) is however difficult due to different aspect ratio and graphs are not in common scale nor alignment.

15/23

Case study 6 Barley Yield Part 1/2

data("barley", package = "lattice")
skimr::skim(barley)

## ── Data Summary ────────────────────────
##                            Values
## Name                       barley
## Number of rows             120   
## Number of columns          4     
## _______________________          
## Column type frequency:           
##   factor                   3     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                        
## 1 variety               0             1 FALSE         10 Sva: 12, No.: 12, Man: 12, No.: 12
## 2 year                  0             1 FALSE          2 193: 60, 193: 60                  
## 3 site                  0             1 FALSE          6 Gra: 20, Dul: 20, Uni: 20, Mor: 20
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 yield                 0             1  34.4  10.3  14.4  26.9  32.9  41.4  65.8 ▃▇▅▂▁

ggplot(barley, aes(yield, variety, shape = year)) + 
  geom_point(size = 3) + 
  facet_wrap(~site) + 
  theme(plot.title.position = "plot",
        plot.title = element_text(face = "bold")) + 
  labs(x = "Yield", shape = "Year", y = "Variety")

10 barley varieties were tested at 6 locations in 1931 and in 1932
Do you notice anything about the yield with respect to the years?

Immer, R. F., H. K. Hayes, and LeRoy Powers. (1934). Statistical Determination of Barley Varietal Adaptation. Journal of the American Society of Agronomy 26 403–419

16/23

Case study 6 Barley Yield Part 1/2

data("barley", package = "lattice")
skimr::skim(barley)

## ── Data Summary ────────────────────────
##                            Values
## Name                       barley
## Number of rows             120   
## Number of columns          4     
## _______________________          
## Column type frequency:           
##   factor                   3     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                        
## 1 variety               0             1 FALSE         10 Sva: 12, No.: 12, Man: 12, No.: 12
## 2 year                  0             1 FALSE          2 193: 60, 193: 60                  
## 3 site                  0             1 FALSE          6 Gra: 20, Dul: 20, Uni: 20, Mor: 20
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 yield                 0             1  34.4  10.3  14.4  26.9  32.9  41.4  65.8 ▃▇▅▂▁

ggplot(barley, aes(yield, variety, color = year)) + 
  geom_point(size = 3) + 
  facet_wrap(~site) + 
  theme(plot.title.position = "plot",
        plot.title = element_text(face = "bold")) + 
  labs(x = "Yield", y = "Variety", color = "Year") + 
  scale_color_discrete_qualitative()

10 barley varieties were tested at 6 locations in 1931 and in 1932
Do you notice anything about the yield with respect to the years?

How about now?

Immer, R. F., H. K. Hayes, and LeRoy Powers. (1934). Statistical Determination of Barley Varietal Adaptation. Journal of the American Society of Agronomy 26 403–419

16/23

Case study 6 Barley Yield Part 1/2

data("barley", package = "lattice")
skimr::skim(barley)

## ── Data Summary ────────────────────────
##                            Values
## Name                       barley
## Number of rows             120   
## Number of columns          4     
## _______________________          
## Column type frequency:           
##   factor                   3     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                        
## 1 variety               0             1 FALSE         10 Sva: 12, No.: 12, Man: 12, No.: 12
## 2 year                  0             1 FALSE          2 193: 60, 193: 60                  
## 3 site                  0             1 FALSE          6 Gra: 20, Dul: 20, Uni: 20, Mor: 20
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 yield                 0             1  34.4  10.3  14.4  26.9  32.9  41.4  65.8 ▃▇▅▂▁

ggplot(barley, aes(yield, variety, color = year)) + 
  geom_point(size = 3, alpha = 0.4) + 
  geom_point(data = subset(barley, (site=="University Farm" & variety == "No. 475") | (site=="Grand Rapids" & variety == "Velvet")), size = 3) +
  facet_wrap(~site) + 
  theme(plot.title.position = "plot",
        plot.title = element_text(face = "bold")) + 
  labs(x = "Yield", y = "Variety", color = "Year") + 
  scale_color_discrete_qualitative()

10 barley varieties were tested at 6 locations in 1931 and in 1932
Do you notice anything about the yield with respect to the years?

How about now?

Immer, R. F., H. K. Hayes, and LeRoy Powers. (1934). Statistical Determination of Barley Varietal Adaptation. Journal of the American Society of Agronomy 26 403–419

16/23

Case study 6 Barley Yield Part 2/2

Cleveland (1993) speculated that the year labels may have been reversed for some data.
Wright (2013) investigated this by examining extended data from 1927 to 1936, in addition to weather covariates, and found that the observations are not particularly unusual.

Cleveland, W. S. (1993) Visualising Data, Summit, NJ: Hobart Press.
Wright, Kevin (2013). Revisiting Immer's Barley Data. The American Statistician. 67 (3) 129–133.

17/23

Case study 7 Olive oils

data(olives, package = "classifly")
df2 <- olives %>% 
  mutate(Region = factor(Region, labels = c("South", "Sardinia", "North")))
skimr::skim(df2)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df2   
## Number of rows             572   
## Number of columns          10    
## _______________________          
## Column type frequency:           
##   factor                   2     
##   numeric                  8     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                         
## 1 Region                0             1 FALSE          3 Sou: 323, Nor: 151, Sar: 98        
## 2 Area                  0             1 FALSE          9 Sou: 206, Inl: 65, Cal: 56, Umb: 51
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   mean    sd    p0    p25   p50    p75  p100 hist 
## 1 palmitic              0             1 1232.  169.    610 1095   1201  1360    1753 ▁▂▇▆▁
## 2 palmitoleic           0             1  126.   52.5    15   87.8  110   169.    280 ▂▇▅▃▁
## 3 stearic               0             1  229.   36.7   152  205    223   249     375 ▂▇▃▁▁
## 4 oleic                 0             1 7312.  406.   6300 7000   7302. 7680    8410 ▁▇▇▇▁
## 5 linoleic              0             1  981.  243.    448  771.  1030  1181.   1470 ▃▅▃▇▃
## 6 linolenic             0             1   31.9  13.0     0   26     33    40.2    74 ▂▅▇▂▁
## 7 arachidic             0             1   58.1  22.0     0   50     61    70     105 ▂▁▇▇▂
## 8 eicosenoic            0             1   16.3  14.1     1    2     17    28      58 ▇▃▅▂▁

g1 <- 
  df2 %>% 
  mutate(Area = fct_reorder(Area, palmitic)) %>% 
  ggplot(aes(Area, palmitic, color = Region)) + 
  geom_boxplot() + 
  scale_color_discrete_qualitative() + 
  guides(color = FALSE, x = guide_axis(n.dodge = 2))
g2 <- ggplot(df2, aes(Region, palmitic, color = Region)) + 
  geom_boxplot() + 
  scale_color_discrete_qualitative() + 
  guides(color = FALSE)
g3 <- ggplot(df2, aes(palmitic, color = Region)) + 
  geom_density() +
  scale_color_discrete_qualitative() + 
  guides(color = FALSE)
g4 <- ggplot(df2, aes(palmitic, color = Region)) + 
  stat_ecdf() +
  scale_color_discrete_qualitative()
g1 / (g2 | (g3 / g4)) + plot_layout(guides = "collect", byrow = FALSE)

18/23

The olive oil data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils.
There are 9 collection areas, 4 from southern Italy (North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and West Liguria).

Comparing graphically Part 2

📊
R

ggplot(olives, aes(palmitoleic, palmitic, color = Area)) + 
  geom_point() + 
  scale_color_discrete_qualitative()

Color is a great way to differentiate categories but if there are too many categories then it becomes hard to compare.
In this scatter plot, there are too many overlapping points so splitting the data to multiple windows via facetting may make it easier to compare.

19/23

Comparing graphically Part 3

📊
R

ggplot(olives, aes(palmitoleic, palmitic, color = Area)) + 
  geom_point() + 
  facet_wrap(~Area) + 
  scale_color_discrete_qualitative() + 
  guides(color = FALSE)

20/23

Comparing graphically Part 3

📊
R

ggplot(olives, aes(palmitoleic, palmitic, color = Area)) + 
  geom_point() + 
  facet_wrap(~Area) + 
  scale_color_discrete_qualitative() + 
  guides(color = FALSE)

📊
R

20/23

Case study 8 England and East Indies trade data

data(EastIndiesTrade, package = "GDAdata")
skimr::skim(EastIndiesTrade)

## ── Data Summary ────────────────────────
##                            Values         
## Name                       EastIndiesTrade
## Number of rows             81             
## Number of columns          3              
## _______________________                   
## Column type frequency:                    
##   numeric                  3              
## ________________________                  
## Group variables            None           
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 Year                  0             1 1740   23.5  1700  1720  1740  1760  1780 ▇▇▇▇▇
## 2 Exports               0             1  518. 421.    100   145   370   840  1395 ▇▂▃▂▂
## 3 Imports               0             1 1005. 320.    460   835   975  1000  1550 ▃▃▇▁▅

g1 <- ggplot(EastIndiesTrade, aes(Year, Exports)) +
    annotate("rect", xmin = 1701, xmax = 1714, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  annotate("rect", xmin = 1756, xmax = 1763, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  annotate("rect", xmin = 1775, xmax = 1780, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  geom_line(color = "#339933", size = 2) + 
  geom_line(aes(Year, Imports), color = "red", size = 2) +
  geom_ribbon(aes(ymin = Exports, ymax = Imports), fill = "gray") + 
  labs(y = "<span style='color:#339933'>Export</span>/<span style='color:red'>Import</span>", tag = "(A)") + 
  theme(axis.title.y = ggtext::element_markdown())
g2 <- ggplot(EastIndiesTrade, aes(Year, Exports - Imports)) +
  annotate("rect", xmin = 1701, xmax = 1714, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  annotate("rect", xmin = 1756, xmax = 1763, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  annotate("rect", xmin = 1775, xmax = 1780, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  geom_line(size = 2) +
  labs(tag = "(B)")
g3 <- ggplot(EastIndiesTrade, aes(Year, (Exports - Imports)/(Exports + Imports) * 2)) + 
    annotate("rect", xmin = 1701, xmax = 1714, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  annotate("rect", xmin = 1756, xmax = 1763, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  annotate("rect", xmin = 1775, xmax = 1780, 
           ymin = -Inf, ymax = Inf, 
           fill = "red", alpha = 0.3) +
  geom_line(color = "#001a66", size = 2) + 
  labs(y = "Relative difference", tag = "(C)")
g1 / g2 / g3

(A) shows the export from England to the East Indies and the import to England from the East Indies in millions of pounds.
Import and export figures are easier to compare by plotting the difference like in (B).
Relative difference may be more of an interest - (C) plots the relative difference with respect to the average of export and import values.
The red area correspond to War of the Spanish Succession (1701-14), Seven Years' War (1756-63) and the American Revolutionary War (1775-83).

21/23

Case study 9 Melbourne's daily maximum temperature

df9 <- read_csv(here::here("data", "melb_temp.csv")) %>% 
  janitor::clean_names() %>% 
  rename(temp = maximum_temperature_degree_c) %>% 
  filter(!is.na(temp)) %>% 
  dplyr::select(year, month, day, temp)
skimr::skim(df9)

## ── Data Summary ────────────────────────
##                            Values
## Name                       df9   
## Number of rows             18310 
## Number of columns          4     
## _______________________          
## Column type frequency:           
##   character                2     
##   numeric                  2     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 month                 0             1     2     2     0       12          0
## 2 day                   0             1     2     2     0       31          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   mean    sd     p0    p25    p50    p75   p100 hist 
## 1 year                  0             1 1995.  14.5  1970   1983   1995   2008   2020   ▇▇▇▇▇
## 2 temp                  0             1   19.9  6.48    5.7   14.8   18.6   23.6   46.8 ▃▇▃▁▁

ggplot(df9, aes(month, temp)) + 
  geom_boxplot() + 
  labs(x = "Month", y = "Maximum temperature (°C)")

Melbourne's daily maximum temperature from 1970 to 2020.
How are the temperature different across months?
What about the temperature within a month?
You'll explore this data in week 7 tutorial!

Thanks to Di for the data!

22/23

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

ETC5521.Clayton-x@monash.edu

Week 6 - Session 1

23/23

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help