Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide
These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See here for PDF .


Press the right arrow to progress to the next slide!

1/47


Advent of "Grammar"


Bridging Statistics and Data Science for the Design of Experiments

Emi Tanaka

Department of Econometrics and Business Statistics

emi.tanaka@monash.edu @statsgen

27th Nov 2020 @ Statistical Society of Australia


1/47
NOTICE
  • This talk is most relevant for people who code
  • Examples are all in R but philosophy holds for other programming languages

1 Grammar of Graphics

2 Grammar of Data Manipulation

3 Grammar of Experimental Design
2/47

1

Grammar of Graphics

ggplot2 πŸ“¦

Wickham (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York

Wilkinson (2005) The Grammar of Graphics. Statistics and Computing. Springer, 2nd edition.

3/47

A typical Australian academic is expected to do 40% teaching, 40% research and 20% administration

and like 200% external service work, outreach, writing grants, reviewing things, sitting on panels, learning how to do online teaching,...

4/47

What are the differences between these plots?

5/47

What are the differences between these plots?

How do you construct these plots?
5/47

Using Base R

df
## duty perc
## 1 Teaching 40
## 2 Research 40
## 3 Admin 20
barplot(as.matrix(df$perc),
legend = df$duty)

pie(df$perc, labels = df$duty)

R Core Team (2020) R: A Language and Environment for Statistical Computing https://www.R-project.org/

6/47

Using Base R

df
## duty perc
## 1 Teaching 40
## 2 Research 40
## 3 Admin 20
barplot(as.matrix(df$perc),
legend = df$duty)

pie(df$perc, labels = df$duty)

R Core Team (2020) R: A Language and Environment for Statistical Computing https://www.R-project.org/

  • Single purpose functions to generate "named plots"
  • Input varies, here it is vector or matrix
6/47

Using ggplot2

df
## duty perc
## 1 Teaching 40
## 2 Research 40
## 3 Admin 20
ggplot(df, aes(x = "", # dummy
y = perc,
fill = duty)) +
geom_col()

Wilkinson (2005) The Grammar of Graphics. Statistics and Computing. Springer, 2nd edition.

Wickham (2008) Practical Tools for Exploring Data and Models. PhD Thesis Chapter 3: A layered grammar of graphics.

Wickham (2010) A Layered Grammar of Graphics, Journal of Computational and Graphical Statistics, 19:1, 3-28

7/47

Using ggplot2

df
## duty perc
## 1 Teaching 40
## 2 Research 40
## 3 Admin 20
ggplot(df, aes(x = "", # dummy
y = perc,
fill = duty)) +
geom_col()

ggplot(df, aes(x = "", # dummy
y = perc,
fill = duty)) +
geom_col() +
coord_polar(theta = "y")

Wilkinson (2005) The Grammar of Graphics. Statistics and Computing. Springer, 2nd edition.

Wickham (2008) Practical Tools for Exploring Data and Models. PhD Thesis Chapter 3: A layered grammar of graphics.

Wickham (2010) A Layered Grammar of Graphics, Journal of Computational and Graphical Statistics, 19:1, 3-28

7/47

Using ggplot2

df
## duty perc
## 1 Teaching 40
## 2 Research 40
## 3 Admin 20
ggplot(df, aes(x = "", # dummy
y = perc,
fill = duty)) +
geom_col()

ggplot(df, aes(x = "", # dummy
y = perc,
fill = duty)) +
geom_col() +
coord_polar(theta = "y")

Wilkinson (2005) The Grammar of Graphics. Statistics and Computing. Springer, 2nd edition.

Wickham (2008) Practical Tools for Exploring Data and Models. PhD Thesis Chapter 3: A layered grammar of graphics.

Wickham (2010) A Layered Grammar of Graphics, Journal of Computational and Graphical Statistics, 19:1, 3-28

  • ggplot2 implements the grammar of graphics
  • the difference between a stacked barplot and a pie chart is that the coordinate system have been transformed from Cartesian coordinate to polar coordinate
7/47

Data Plot

  • ggplot uses tidy data as input so plot construction is enforced by consistent thinking in relation to tidy data
    Tidy data principles
    1. Each variable must have its own column.
    2. Each observation must have its own row.
    3. Each value must have its own cell.
  • Variables are mapped to a plot aesthetic
  • Plots are constructed from its components expressed by the grammar of graphics

Wickham (2014) Tidy Data. Journal of Statistical Software, Articles 59 (10): 1–23.

8/47

Do you like potatoes?

9/47

Do you like potatoes?


Let's have a look at a potato field experiment to understand the infection rate when it's treated with sulfur at different season and amount?

9/47

How do we get these graphs?

data(cochran.crd,
package = "agridat")
cochran.crd
## inf trt row col
## 1 9 F3 4 1
## 2 12 O 4 2
## 3 18 S6 4 3
## 4 10 F12 4 4
## 5 24 S6 4 5
## 6 17 S12 4 6
## 7 30 S3 4 7
## 8 16 F6 4 8
## 9 10 O 3 1
## 10 7 S3 3 2
## 11 4 F12 3 3
## 12 10 F6 3 4
## 13 21 S3 3 5
## 14 24 O 3 6
## 15 29 O 3 7
## 16 12 S6 3 8
## 17 9 F3 2 1
## 18 7 S12 2 2
## 19 18 F6 2 3
## 20 30 O 2 4
## 21 18 F6 2 5
## 22 16 S12 2 6
## 23 16 F3 2 7
## 24 4 F12 2 8
## 25 9 S3 1 1
## 26 18 O 1 2
## 27 17 S12 1 3
## 28 19 S6 1 4
## 29 32 O 1 5
## 30 5 F12 1 6
## 31 26 O 1 7
## 32 4 F3 1 8

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17.

W.G. Cochran and G. Cox, 1957. Experimental Designs, 2nd ed. John Wiley, New York.

10/47

How do we get these graphs?

data(cochran.crd,
package = "agridat")
cochran.crd
## inf trt row col
## 1 9 F3 4 1
## 2 12 O 4 2
## 3 18 S6 4 3
## 4 10 F12 4 4
## 5 24 S6 4 5
## 6 17 S12 4 6
## 7 30 S3 4 7
## 8 16 F6 4 8
## 9 10 O 3 1
## 10 7 S3 3 2
## 11 4 F12 3 3
## 12 10 F6 3 4
## 13 21 S3 3 5
## 14 24 O 3 6
## 15 29 O 3 7
## 16 12 S6 3 8
## 17 9 F3 2 1
## 18 7 S12 2 2
## 19 18 F6 2 3
## 20 30 O 2 4
## 21 18 F6 2 5
## 22 16 S12 2 6
## 23 16 F3 2 7
## 24 4 F12 2 8
## 25 9 S3 1 1
## 26 18 O 1 2
## 27 17 S12 1 3
## 28 19 S6 1 4
## 29 32 O 1 5
## 30 5 F12 1 6
## 31 26 O 1 7
## 32 4 F3 1 8

ggplot(cochran.crd,
aes(x = col,
y = row,
fill = inf)) +
geom_tile(color = "black", size = 1.2) +
scale_fill_gradient(low = "gold",
high = "firebrick4") +
theme(text = element_text(size = 20))
ggplot(cochran.crd,
aes(x = col,
y = row,
fill = trt)) +
geom_tile(color = "black", size = 1.2) +
scale_fill_viridis_d() +
theme(text = element_text(size = 20))

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17.

W.G. Cochran and G. Cox, 1957. Experimental Designs, 2nd ed. John Wiley, New York.

10/47

Lessons from the grammar of graphics
in ggplot2


  • Modifiable: ggplot object can be modified
11/47

Lessons from the grammar of graphics
in ggplot2


  • Modifiable: ggplot object can be modified
  • Generalisable: Commands may feel verbose (compared to single-graphics functions) but you're less limited in the plots you can construct without an exhaustive list of functions (compared to the number of plots you can create)
11/47

Lessons from the grammar of graphics
in ggplot2


  • Modifiable: ggplot object can be modified
  • Generalisable: Commands may feel verbose (compared to single-graphics functions) but you're less limited in the plots you can construct without an exhaustive list of functions (compared to the number of plots you can create)
  • Extensible: the system can be extended to make specialised plots or add more features if the same "grammar" is adopted
11/47

2

Grammar of
Data Manipulation

dplyr πŸ“¦

Wickham, François, Henry & Müller (2020) dplyr: A Grammar of Data Manipulation. R-package version 1.0.0.

12/47

What does the code below do?

# part 1
tapply(cochran.crd$inf, cochran.crd$trt, mean)
# F12 F3 F6 O S12 S3 S6
# 5.750 9.500 15.500 22.625 14.250 16.750 18.250
df <- cochran.crd
# part 2
df$season <- ifelse(grepl("F", df$trt),
"Fall",
"Spring")
df$season[df$trt=="O"] <- "N/A"
# part 3
df$sulfur <- sapply(df$trt,
function(x) switch(as.character(x),
F3 = 300, F6 = 600, F12 = 1200,
S3 = 300, S6 = 600, S12 = 1200,
O = 0)) # pounds/acre
# part 4, move trt after col
df[, c("inf", "row", "col", "trt",
"season", "sulfur")]
# inf row col trt season sulfur
# 1 9 4 1 F3 Fall 300
# 2 12 4 2 O N/A 0
# 3 18 4 3 S6 Spring 600
# 4 10 4 4 F12 Fall 1200
# 5 24 4 5 S6 Spring 600
# 6 17 4 6 S12 Spring 1200
# 7 30 4 7 S3 Spring 300
# 8 16 4 8 F6 Fall 600
# 9 10 3 1 O N/A 0
# 10 7 3 2 S3 Spring 300
# 11 4 3 3 F12 Fall 1200
# 12 10 3 4 F6 Fall 600
# 13 21 3 5 S3 Spring 300
# 14 24 3 6 O N/A 0
# 15 29 3 7 O N/A 0
# 16 12 3 8 S6 Spring 600
# 17 9 2 1 F3 Fall 300
# 18 7 2 2 S12 Spring 1200
# 19 18 2 3 F6 Fall 600
# 20 30 2 4 O N/A 0
# 21 18 2 5 F6 Fall 600
# 22 16 2 6 S12 Spring 1200
# 23 16 2 7 F3 Fall 300
# 24 4 2 8 F12 Fall 1200
# 25 9 1 1 S3 Spring 300
# 26 18 1 2 O N/A 0
# 27 17 1 3 S12 Spring 1200
# 28 19 1 4 S6 Spring 600
# 29 32 1 5 O N/A 0
# 30 5 1 6 F12 Fall 1200
# 31 26 1 7 O N/A 0
# 32 4 1 8 F3 Fall 300
01:00
13/47

Which is easier to read for you? Part 1

# part 1
tapply(cochran.crd$inf, cochran.crd$trt, mean)
# F12 F3 F6 O S12 S3 S6
# 5.750 9.500 15.500 22.625 14.250 16.750 18.250
# part 1
cochran.crd %>%
group_by(trt) %>%
summarise(inf_mean = mean(inf))
# # A tibble: 7 x 2
# trt inf_mean
# <fct> <dbl>
# 1 F12 5.75
# 2 F3 9.5
# 3 F6 15.5
# 4 O 22.6
# 5 S12 14.2
# 6 S3 16.8
# 7 S6 18.2
14/47

Which is easier to read for you? Part 1

# part 1
tapply(cochran.crd$inf, cochran.crd$trt, mean)
# F12 F3 F6 O S12 S3 S6
# 5.750 9.500 15.500 22.625 14.250 16.750 18.250
  • Can you guess what tapply does if you don't know about it?
# part 1
cochran.crd %>%
group_by(trt) %>%
summarise(inf_mean = mean(inf))
# # A tibble: 7 x 2
# trt inf_mean
# <fct> <dbl>
# 1 F12 5.75
# 2 F3 9.5
# 3 F6 15.5
# 4 O 22.6
# 5 S12 14.2
# 6 S3 16.8
# 7 S6 18.2
14/47

Which is easier to read for you? Part 1

# part 1
tapply(cochran.crd$inf, cochran.crd$trt, mean)
# F12 F3 F6 O S12 S3 S6
# 5.750 9.500 15.500 22.625 14.250 16.750 18.250
  • Can you guess what tapply does if you don't know about it?
  • Functions are less accessible to broader audiences if function names and argument names are designed so:
    • specialist knowledge is required,
    • approach is ad-hoc, and/or
    • unintuitively designed for either input and/or output
# part 1
cochran.crd %>%
group_by(trt) %>%
summarise(inf_mean = mean(inf))
# # A tibble: 7 x 2
# trt inf_mean
# <fct> <dbl>
# 1 F12 5.75
# 2 F3 9.5
# 3 F6 15.5
# 4 O 22.6
# 5 S12 14.2
# 6 S3 16.8
# 7 S6 18.2
14/47

Which is easier to read for you? Part 2

df <- cochran.crd
# part 2
df$season <- ifelse(grepl("F", df$trt),
"Fall",
"Spring")
df$season[df$trt=="O"] <- "N/A"
# part 3
df$sulfur <- sapply(df$trt,
function(x) switch(as.character(x),
F3 = 300, F6 = 600, F12 = 1200,
S3 = 300, S6 = 600, S12 = 1200,
O = 0)) # pounds/acre
# part 4, move trt after col
df[, c("inf", "row", "col", "trt",
"season", "sulfur")]
# inf row col trt season sulfur
# 1 9 4 1 F3 Fall 300
# 2 12 4 2 O N/A 0
# 3 18 4 3 S6 Spring 600
# 4 10 4 4 F12 Fall 1200
# 5 24 4 5 S6 Spring 600
# 6 17 4 6 S12 Spring 1200
# 7 30 4 7 S3 Spring 300
# 8 16 4 8 F6 Fall 600
# 9 10 3 1 O N/A 0
# 10 7 3 2 S3 Spring 300
# 11 4 3 3 F12 Fall 1200
# 12 10 3 4 F6 Fall 600
# 13 21 3 5 S3 Spring 300
# 14 24 3 6 O N/A 0
# 15 29 3 7 O N/A 0
# 16 12 3 8 S6 Spring 600
# 17 9 2 1 F3 Fall 300
# 18 7 2 2 S12 Spring 1200
# 19 18 2 3 F6 Fall 600
# 20 30 2 4 O N/A 0
# 21 18 2 5 F6 Fall 600
# 22 16 2 6 S12 Spring 1200
# 23 16 2 7 F3 Fall 300
# 24 4 2 8 F12 Fall 1200
# 25 9 1 1 S3 Spring 300
# 26 18 1 2 O N/A 0
# 27 17 1 3 S12 Spring 1200
# 28 19 1 4 S6 Spring 600
# 29 32 1 5 O N/A 0
# 30 5 1 6 F12 Fall 1200
# 31 26 1 7 O N/A 0
# 32 4 1 8 F3 Fall 300
cochran.crd %>%
# part 2 & 3
mutate(season = case_when(
str_detect(trt, "F") ~ "Fall",
str_detect(trt, "S") ~ "Spring",
TRUE ~ "N/A"),
sulfur = case_when(
str_detect(trt, "12") ~ 1200,
str_detect(trt, "6") ~ 600,
str_detect(trt, "3") ~ 300,
TRUE ~ 0)) %>%
# part 4
relocate(trt, .after = col) %>%
as_tibble() # optional
# # A tibble: 32 x 6
# inf row col trt season sulfur
# <int> <int> <int> <fct> <chr> <dbl>
# 1 9 4 1 F3 Fall 300
# 2 12 4 2 O N/A 0
# 3 18 4 3 S6 Spring 600
# 4 10 4 4 F12 Fall 1200
# 5 24 4 5 S6 Spring 600
# 6 17 4 6 S12 Spring 1200
# 7 30 4 7 S3 Spring 300
# 8 16 4 8 F6 Fall 600
# 9 10 3 1 O N/A 0
# 10 7 3 2 S3 Spring 300
# # … with 22 more rows
15/47

Which is easier to read for you? Part 3

🎯 find the mean of inf by the combination of variables season and sulfur
How would you modify the code below?

tapply(df$inf, df$trt, mean)
## F12 F3 F6 O S12 S3 S6
## 5.750 9.500 15.500 22.625 14.250 16.750 18.250
df %>%
group_by(trt) %>%
summarise(inf_mean = mean(inf))
## # A tibble: 7 x 2
## trt inf_mean
## <fct> <dbl>
## 1 F12 5.75
## 2 F3 9.5
## 3 F6 15.5
## 4 O 22.6
## 5 S12 14.2
## 6 S3 16.8
## 7 S6 18.2
00:30
16/47

Which is easier to read for you? Part 3

🎯 find the mean of inf by the combination of variables season and sulfur
How would you modify the code below?

tapply(df$inf, list(df$season, df$sulfur), mean)
## 0 300 600 1200
## Fall NA 9.50 15.50 5.75
## N/A 22.625 NA NA NA
## Spring NA 16.75 18.25 14.25

Notice that the above output is matrix.

df %>%
group_by(season, sulfur) %>%
summarise(inf_mean = mean(inf))
## # A tibble: 7 x 3
## # Groups: season [3]
## season sulfur inf_mean
## <chr> <dbl> <dbl>
## 1 Fall 300 9.5
## 2 Fall 600 15.5
## 3 Fall 1200 5.75
## 4 N/A 0 22.6
## 5 Spring 300 16.8
## 6 Spring 600 18.2
## 7 Spring 1200 14.2
16/47

Which is easier to read for you? Part 3

🎯 find the mean of inf by the combination of variables season and sulfur
How would you modify the code below?

tapply(df$inf, list(df$season, df$sulfur), mean)
## 0 300 600 1200
## Fall NA 9.50 15.50 5.75
## N/A 22.625 NA NA NA
## Spring NA 16.75 18.25 14.25

Notice that the above output is matrix.

This works also:

tapply(df$inf, paste0(df$season, df$sulfur), mean)
## Fall1200 Fall300 Fall600 N/A0 Spring1200
## 5.750 9.500 15.500 22.625 14.250
## Spring300 Spring600
## 16.750 18.250
df %>%
group_by(season, sulfur) %>%
summarise(inf_mean = mean(inf))
## # A tibble: 7 x 3
## # Groups: season [3]
## season sulfur inf_mean
## <chr> <dbl> <dbl>
## 1 Fall 300 9.5
## 2 Fall 600 15.5
## 3 Fall 1200 5.75
## 4 N/A 0 22.6
## 5 Spring 300 16.8
## 6 Spring 600 18.2
## 7 Spring 1200 14.2
16/47

Reading code like English grammar

df <- cochran.crd %>%
mutate(season = case_when(
str_detect(trt, "F") ~ "Fall",
str_detect(trt, "S") ~ "Spring",
TRUE ~ "N/A"),
sulfur = case_when(
str_detect(trt, "12") ~ 1200,
str_detect(trt, "6") ~ 600,
str_detect(trt, "3") ~ 300,
TRUE ~ 0)) %>%
relocate(trt, .after = col) %>%
as_tibble() # optional
df %>%
group_by(season, sulfur) %>%
summarise(inf_mean = mean(inf))
  • Read %>% as "and then"
  • The code may read more familiar to an English user without much specialist knowledge
  • Of course this is biased in favour of people who know English
  • Bonus: no intermediate naming to think or worry about
17/47

Lessons from grammar of data manipulation
in dplyr


  • Leveraging existing knowledge of an everyday user or an "average" statistician for the choice of function names and argument names can make it more accessible
18/47

Lessons from grammar of data manipulation
in dplyr


  • Leveraging existing knowledge of an everyday user or an "average" statistician for the choice of function names and argument names can make it more accessible
  • Making it more accessible, by making code more expressive or otherwise, can promote statistical literacy
18/47

Lessons from grammar of data manipulation
in dplyr


  • Leveraging existing knowledge of an everyday user or an "average" statistician for the choice of function names and argument names can make it more accessible
  • Making it more accessible, by making code more expressive or otherwise, can promote statistical literacy
  • Consistency in input and output makes it easier for user expectation and helps to offload mental load
18/47

Can all this work be considered as statistics?

19/47

Can all this work be considered as statistics?

Or data science?


19/47

Can all this work be considered as statistics?

Or data science?


Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data. Wikipedia, accessed on 27th Nov 2020

19/47

Can all this work be considered as statistics?

Or data science?


Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data. Wikipedia, accessed on 27th Nov 2020


Isn't this what statistics is about?

19/47

There are two types of people...


20/47

There are two types of people...



But actually, it's hugely unbalanced

21/47

Statistics by nature is inter-disciplinary and about extracting knowledge and insight from data

at least from my point of view

22/47

Statistics by nature is inter-disciplinary and about extracting knowledge and insight from data

at least from my point of view

but communication largely became exclusive to selective set of statistically well-trained people governed by strict adherence to defintions or protocols in lieu of practicality

which is not to say it is bad to do this

22/47

Far more people need to use statistics than just statistically well-trained people


Wickham (2015) Teaching Safe-Stats, Not Statistical Abstinence. The American Statistician, Online Discussion

23/47

Far more people need to use statistics than just statistically well-trained people


Statistics community need to go beyond talking among like-minded peers

Wickham (2015) Teaching Safe-Stats, Not Statistical Abstinence. The American Statistician, Online Discussion

23/47

3

Grammar of
Experimental Design

edibble πŸ“¦

(Work-In-Progress)

24/47

Typical course in experimental design
(at least at University of Sydney in 2017-2019)

Teach:

  • Completely Randomised Design
  • Randomised Complete Block Design
  • Latin Square Design
  • Balanced Incomplete Block Design
  • Factorial Design
  • 2k Factorial Design (I removed this from 2018, I won't talk about this today)
  • Split-plot Design (I added this from 2018 among other concepts)
25/47

Completely Randomised Design (CRD)


  • t treatments randomised to n units


observation=mean+treatment+error

(with constraints and distributional assumptions)


26/47

Randomised Complete Block Design (RCBD)


  • b blocks of size t
  • t treatments randomised to t units within each block

observation=mean+treatment+block+error

27/47

Latin Square Design (LSD)


  • two orthogonal blocks of size t
  • t treatments randomised to units such that every treatment appears exactly once in each block

observation=mean+treatment+row+column+error

28/47

Balanced Incomplete Block Design (BIBD)


  • b blocks of size k<t
  • t treatments randomised to units within each block such that every pair of treatment appears the same number of times across blocks

observation=mean+block+treatment+error

29/47

Factorial Design


  • ab=t treatments randomised to n units
  • treatment is every combination of two factors A and B

observation=mean+A+B+A:B+error

30/47

Split-plot Design


  • n1 whole plots consisting of b sub plots
  • in total there are n sub plots
  • treatment factor A is randomised to whole plots
  • treatment factor B is randomised to sub plots within each whole plot

observation=mean+A+WP+B+A:B+error

31/47

CRAN Task View of Design of Experiments

contains

πŸ“¦ 229 R-packages

as of 2020-07-14

(Please note that there may be some webscrapping error)

32/47

Top 10 downloaded R-packages in 2018

agricolae is the most downloaded

(data from cranlogs from 2018-01-01 to 2018-12-31)

33/47

Top 10 downloaded R-packages in 2019

Okay fine, results were different in 2019

(data from cranlogs from 2019-01-01 to 2019-12-31)

Please note agricolae imports AlgDesign

34/47

agricolae::design.crd

Completely randomised design for t=3 treatments with 2 replicates each

trt <- c("A", "B", "C")
agricolae::design.crd(trt = trt, r = 2) %>% glimpse()
## List of 2
## $ parameters:List of 7
## ..$ design: chr "crd"
## ..$ trt : chr [1:3] "A" "B" "C"
## ..$ r : num [1:3] 2 2 2
## ..$ serie : num 2
## ..$ seed : int 396269021
## ..$ kinds : chr "Super-Duper"
## ..$ : logi TRUE
## $ book :'data.frame': 6 obs. of 3 variables:
## ..$ plots: num [1:6] 101 102 103 104 105 106
## ..$ r : int [1:6] 1 1 2 1 2 2
## ..$ trt : chr [1:6] "C" "A" "C" "B" ...
35/47

agricolae::design.rcbd

Randomised complete block design for t=3 treatments with 2 blocks

trt <- c("A", "B", "C")
agricolae::design.rcbd(trt = trt, r = 2) %>% glimpse()
## List of 3
## $ parameters:List of 7
## ..$ design: chr "rcbd"
## ..$ trt : chr [1:3] "A" "B" "C"
## ..$ r : num 2
## ..$ serie : num 2
## ..$ seed : int 973575053
## ..$ kinds : chr "Super-Duper"
## ..$ : logi TRUE
## $ sketch : chr [1:2, 1:3] "C" "A" "B" "C" ...
## $ book :'data.frame': 6 obs. of 3 variables:
## ..$ plots: num [1:6] 101 102 103 201 202 203
## ..$ block: Factor w/ 2 levels "1","2": 1 1 1 2 2 2
## ..$ trt : Factor w/ 3 levels "A","B","C": 3 2 1 1 3 2

36/47

agricolae::design.lsd()

Latin square design for t=3 treatments

trt <- c("A", "B", "C")
agricolae::design.lsd(trt = trt) %>% glimpse()
## List of 3
## $ parameters:List of 7
## ..$ design: chr "lsd"
## ..$ trt : chr [1:3] "A" "B" "C"
## ..$ r : int 3
## ..$ serie : num 2
## ..$ seed : int -1984440067
## ..$ kinds : chr "Super-Duper"
## ..$ : logi TRUE
## $ sketch : chr [1:3, 1:3] "B" "C" "A" "A" ...
## $ book :'data.frame': 9 obs. of 4 variables:
## ..$ plots: num [1:9] 101 102 103 201 202 203 301 302 303
## ..$ row : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3
## ..$ col : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3
## ..$ trt : Factor w/ 3 levels "A","B","C": 2 1 3 3 2 1 1 3 2

37/47

agricolae::design.bib()

Balanced incomplete block design for t=3 treatments with block size of 2

trt <- c("A", "B", "C")
agricolae::design.bib(trt = trt, k = 2) %>% glimpse()
## [1] "No improvement over initial random design."
##
## Parameters BIB
## ==============
## Lambda : 1
## treatmeans : 3
## Block size : 2
## Blocks : 3
## Replication: 2
##
## Efficiency factor 0.75
##
## <<< Book >>>
## List of 4
## $ parameters:List of 6
## ..$ design: chr "bib"
## ..$ trt : chr [1:3] "A" "B" "C"
## ..$ k : num 2
## ..$ serie : num 2
## ..$ seed : int -509134623
## ..$ kinds : chr "Super-Duper"
## $ statistics:'data.frame': 1 obs. of 6 variables:
## ..$ lambda : num 1
## ..$ treatmeans: int 3
## ..$ blockSize : num 2
## ..$ blocks : int 3
## ..$ r : num 2
## ..$ Efficiency: num 0.75
## $ sketch : chr [1:3, 1:2] "C" "B" "B" "A" ...
## $ book :'data.frame': 6 obs. of 3 variables:
## ..$ plots: num [1:6] 101 102 201 202 301 302
## ..$ block: Factor w/ 3 levels "1","2","3": 1 1 2 2 3 3
## ..$ trt : Factor w/ 3 levels "A","B","C": 3 1 2 1 2 3

38/47

agricolae::design.ab()

Factorial design for t=3Γ—2 treatments with 2 replication for each treatment

agricolae::design.ab(trt = c(3, 2), r = 2, design = "crd") %>% glimpse()
## List of 2
## $ parameters:List of 8
## ..$ design : chr "factorial"
## ..$ trt : chr [1:6] "1 1" "1 2" "2 1" "2 2" ...
## ..$ r : num [1:6] 2 2 2 2 2 2
## ..$ serie : num 2
## ..$ seed : int 801301585
## ..$ kinds : chr "Super-Duper"
## ..$ : logi TRUE
## ..$ applied: chr "crd"
## $ book :'data.frame': 12 obs. of 4 variables:
## ..$ plots: num [1:12] 101 102 103 104 105 106 107 108 109 110 ...
## ..$ r : int [1:12] 1 1 1 2 1 1 1 2 2 2 ...
## ..$ A : chr [1:12] "2" "3" "2" "2" ...
## ..$ B : chr [1:12] "2" "2" "1" "2" ...

Note not A/B testing!



39/47

agricolae::design.split()

Split-plot design for t=2Γ—4 treatments with 2 replication for each treatment

trt1 <- c("I", "R"); trt2 <- LETTERS[1:4]
agricolae::design.split(trt1 = trt1, trt2 = trt2, r = 2, design = "crd") %>%
glimpse()
## List of 2
## $ parameters:List of 8
## ..$ design : chr "split"
## ..$ : logi TRUE
## ..$ trt1 : chr [1:2] "I" "R"
## ..$ applied: chr "crd"
## ..$ r : num [1:2] 2 2
## ..$ serie : num 2
## ..$ seed : int -765020087
## ..$ kinds : chr "Super-Duper"
## $ book :'data.frame': 16 obs. of 5 variables:
## ..$ plots : num [1:16] 101 101 101 101 102 102 102 102 103 103 ...
## ..$ splots: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
## ..$ r : int [1:16] 1 1 1 1 1 1 1 1 2 2 ...
## ..$ trt1 : chr [1:16] "I" "I" "I" "I" ...
## ..$ trt2 : chr [1:16] "D" "A" "C" "B" ...
40/47

Grammar of experimental design

41/47

Grammar of experimental design

  • Remember Base R plots? base plot
  • πŸ€” "named experimental design" functions (agricolae::design.crd, etc.) are like "named statistical graphic" functions (pie, barplot)
41/47

Grammar of experimental design

  • Remember Base R plots? base plot
  • πŸ€” "named experimental design" functions (agricolae::design.crd, etc.) are like "named statistical graphic" functions (pie, barplot)
  • πŸ’‘ construction of experimental design needs to be made more accessible, modifiable, extensible and generalisable
41/47

Grammar of experimental design

  • Remember Base R plots? base plot
  • πŸ€” "named experimental design" functions (agricolae::design.crd, etc.) are like "named statistical graphic" functions (pie, barplot)
  • πŸ’‘ construction of experimental design needs to be made more accessible, modifiable, extensible and generalisable
  • This is the concept for the development of the edibble R-package
    https://github.com/emitanaka/edibble
    ( WIP, sorry code not ready yet, please enjoy prototype and live demo instead)
41/47

Grammar of experimental design

  • Remember Base R plots? base plot
  • πŸ€” "named experimental design" functions (agricolae::design.crd, etc.) are like "named statistical graphic" functions (pie, barplot)
  • πŸ’‘ construction of experimental design needs to be made more accessible, modifiable, extensible and generalisable
  • This is the concept for the development of the edibble R-package
    https://github.com/emitanaka/edibble
    ( WIP, sorry code not ready yet, please enjoy prototype and live demo instead)
  • tibble R-package is a modern reimagining of the data.frame
  • edibble (WIP) creates experimental design tibbles

MΓΌller & Wickham (2020) tibble: Simple Data Frames. R package version 3.0.3

41/47

Prototype for grammar of experimental design

  • Consider a field experiment with 32 plots
library(edibble)
initiate_design(seed = 2020) %>%
set_units(plot = 32)
## An edibble design
## └─plot (32 levels)
42/47

Prototype Grammar of Experimental Design

  • Consider a field experiment with 32 plots
  • There are 7 different treatments
library(edibble)
initiate_design(seed = 2020) %>%
set_units(plot = 32) %>%
set_trts(trt = c("F12", "F3", "F6", "O",
"S12", "S3", "S6"))
## An edibble design
## β”œβ”€plot (32 levels)
## └─trt (7 levels)
42/47

Prototype Grammar of Experimental Design 1

  • Consider a field experiment with 32 plots
  • There are 7 different treatments
  • Apply treatments to plots
library(edibble)
initiate_design(seed = 2020) %>%
set_units(plot = 32) %>%
set_trts(trt = c("F12", "F3", "F6", "O",
"S12", "S3", "S6")) %>%
apply_trts(~ plot)
## An edibble design
## β”œβ”€plot (32 levels)
## └─trt (7 levels)
42/47

Prototype Grammar of Experimental Design 1

  • Consider a field experiment with 32 plots
  • There are 7 different treatments
  • Apply treatments to plots
  • Randomise treatments to plots
library(edibble)
initiate_design(seed = 2020) %>%
set_units(plot = 32) %>%
set_trts(trt = c("F12", "F3", "F6", "O",
"S12", "S3", "S6")) %>%
apply_trts(~ plot) %>%
randomise_trts()
## An edibble design
## β”œβ”€plot (32 levels)
## └─trt (7 levels)
42/47

Prototype Grammar of Experimental Design 1

  • Consider a field experiment with 32 plots
  • There are 7 different treatments
  • Apply treatments to plots
  • Randomise treatments to plots
  • The result is a completely randomised design
library(edibble)
initiate_design(seed = 2020) %>%
set_units(plot = 32) %>%
set_trts(trt = c("F12", "F3", "F6", "O",
"S12", "S3", "S6")) %>%
apply_trts(~ plot) %>%
randomise_trts() %>%
serve_table()
## # An edibble: 32 x 2
## plot trt
## <fct> <fct>
## 1 plot1 F12
## 2 plot2 S3
## 3 plot3 O
## 4 plot4 S6
## 5 plot5 F3
## 6 plot6 S12
## 7 plot7 F6
## 8 plot8 S6
## 9 plot9 F3
## 10 plot10 S3
## # … with 22 more rows
42/47

Prototype Grammar of Experimental Design 2

  • Consider a field experiment with 2 blocks each with 16 plots
  • There are 7 different treatments
  • Apply treatments to plots
  • Randomise treatments to plots within blocks
  • Resulting design is a randomised block design


Can you see how it differs from the previous design?

library(edibble)
initiate_design(seed = 2020) %>%
set_units(block = c("B1", "B2"),
plot = nested_in(block, 16)) %>%
set_trts(trt = c("F12", "F3", "F6", "O",
"S12", "S3", "S6")) %>%
apply_trts(~ plot) %>%
randomise_trts() %>%
serve_table()
## # An edibble: 32 x 3
## block plot trt
## <fct> <fct> <fct>
## 1 B1 plot1 F12
## 2 B1 plot2 S3
## 3 B1 plot3 S12
## 4 B1 plot4 S6
## 5 B1 plot5 F3
## 6 B1 plot6 S3
## 7 B1 plot7 F6
## 8 B1 plot8 S6
## 9 B1 plot9 F3
## 10 B1 plot10 S6
## # … with 22 more rows
43/47

LIVE DEMO

44/47

Limitations

  • Generalising experimental designs is extremely hard
  • A lot of domains have their own specialised design
45/47

Limitations

  • Generalising experimental designs is extremely hard
  • A lot of domains have their own specialised design
  • Some specialist designs that have certain properties cannot be easily replicated by edibble
  • An ad-hoc approach may still be required to create design with specific properties
45/47

Limitations

  • Generalising experimental designs is extremely hard
  • A lot of domains have their own specialised design
  • Some specialist designs that have certain properties cannot be easily replicated by edibble
  • An ad-hoc approach may still be required to create design with specific properties but the hope is that this can be achieved with replacement in one of the steps (e.g. replacing randomise_trts)
45/47

Limitations

  • Generalising experimental designs is extremely hard
  • A lot of domains have their own specialised design
  • Some specialist designs that have certain properties cannot be easily replicated by edibble
  • An ad-hoc approach may still be required to create design with specific properties but the hope is that this can be achieved with replacement in one of the steps (e.g. replacing randomise_trts)
  • So why would anyone buy into this framework?
45/47

Limitations

  • Generalising experimental designs is extremely hard
  • A lot of domains have their own specialised design
  • Some specialist designs that have certain properties cannot be easily replicated by edibble
  • An ad-hoc approach may still be required to create design with specific properties but the hope is that this can be achieved with replacement in one of the steps (e.g. replacing randomise_trts)
  • So why would anyone buy into this framework?
  • Hopefully, developers and users can see the benefit in being more expressive in code; and also a pay-off like easy visualisation of designs becomes worthwhile
45/47

Timeline

  • By end of 2020, I intend to have edibble be able to construct textbook (mostly orthogonal) designs.
  • From 2021, I plan to hold (free) workshops in experimental design with edibble to generate certain designs.
  • The initial development for edibble is focussed on designs with "fixed" structures.
  • For survey and sampling designs you should check out DeclareDesign.



Feedback is welcomed!


             emitanaka/edibble/issues
             emi.tanaka@monash.edu

Slides can be found at emitanaka.org/slides/SSA2020/
46/47

Acknowledgements

This slide was made using xaringan R-package and the following systems.

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.1 (2020-06-06)
## os macOS Catalina 10.15.7
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_AU.UTF-8
## ctype en_AU.UTF-8
## tz Australia/Melbourne
## date 2020-11-27
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## agricolae 1.3-3 2020-06-07 [1] CRAN (R 4.0.1)
## AlgDesign 1.2.0 2019-11-29 [1] CRAN (R 4.0.1)
## anicon 0.1.0 2020-06-21 [1] Github (emitanaka/anicon@0b756df)
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.0)
## backports 1.2.0 2020-11-02 [1] CRAN (R 4.0.2)
## broom 0.7.2 2020-10-20 [1] CRAN (R 4.0.2)
## callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.2)
## cellranger 1.1.0 2016-07-27 [2] CRAN (R 4.0.0)
## cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.1)
## cluster 2.1.0 2019-06-19 [2] CRAN (R 4.0.1)
## colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2)
## combinat 0.0-8 2012-10-29 [1] CRAN (R 4.0.1)
## countdown 0.3.5 2020-07-20 [1] Github (gadenbuie/countdown@a544fa4)
## crayon 1.3.4 2017-09-16 [2] CRAN (R 4.0.0)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2)
## dbplyr 2.0.0 2020-11-03 [1] CRAN (R 4.0.2)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
## dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
## edibble * 0.0.0.9000 2020-11-15 [1] local
## ellipsis 0.3.1 2020-05-15 [2] CRAN (R 4.0.0)
## emo 0.0.0.9000 2020-06-26 [1] Github (hadley/emo@3f03b11)
## evaluate 0.14 2019-05-28 [2] CRAN (R 4.0.0)
## fansi 0.4.1 2020-01-08 [2] CRAN (R 4.0.0)
## farver 2.0.3.9000 2020-07-24 [1] Github (thomasp85/farver@f1bcb56)
## fastmap 1.0.1 2019-10-08 [2] CRAN (R 4.0.0)
## forcats * 0.5.0 2020-03-01 [2] CRAN (R 4.0.0)
## fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
## generics 0.1.0 2020-10-31 [2] CRAN (R 4.0.2)
## ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.2)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
## gtable 0.3.0 2019-03-25 [2] CRAN (R 4.0.0)
## haven 2.3.1 2020-06-01 [2] CRAN (R 4.0.0)
## highr 0.8 2019-03-20 [2] CRAN (R 4.0.0)
## hms 0.5.3 2020-01-08 [2] CRAN (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
## httpuv 1.5.4 2020-06-06 [2] CRAN (R 4.0.1)
## httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
## icon 0.1.0 2020-06-21 [1] Github (emitanaka/icon@8458546)
## igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.2)
## jsonlite 1.7.1 2020-09-07 [1] CRAN (R 4.0.2)
## klaR 0.6-15 2020-02-19 [1] CRAN (R 4.0.1)
## knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
## labelled 2.7.0 2020-09-21 [1] CRAN (R 4.0.2)
## later 1.1.0.1 2020-06-05 [2] CRAN (R 4.0.1)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.1)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
## lubridate 1.7.9 2020-06-08 [2] CRAN (R 4.0.1)
## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
## MASS 7.3-53 2020-09-09 [1] CRAN (R 4.0.2)
## mime 0.9 2020-02-04 [2] CRAN (R 4.0.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.0.0)
## modelr 0.1.8 2020-05-19 [2] CRAN (R 4.0.0)
## munsell 0.5.0 2018-06-12 [2] CRAN (R 4.0.0)
## nlme 3.1-150 2020-10-24 [2] CRAN (R 4.0.2)
## pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.1)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.0)
## processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
## promises 1.1.1 2020-06-09 [1] CRAN (R 4.0.0)
## ps 1.4.0 2020-10-07 [1] CRAN (R 4.0.2)
## purrr * 0.3.4 2020-04-17 [2] CRAN (R 4.0.0)
## questionr 0.7.3 2020-10-05 [1] CRAN (R 4.0.2)
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
## Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.0)
## readr * 1.4.0 2020-10-05 [2] CRAN (R 4.0.2)
## readxl 1.3.1 2019-03-13 [2] CRAN (R 4.0.0)
## reprex 0.3.0.9001 2020-08-08 [1] Github (tidyverse/reprex@9594ee9)
## rlang 0.4.8 2020-10-08 [1] CRAN (R 4.0.2)
## rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.1)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.1)
## rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
## scales 1.1.1 2020-05-11 [2] CRAN (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 4.0.0)
## shiny 1.5.0 2020-06-23 [1] CRAN (R 4.0.2)
## stringi 1.5.3 2020-09-09 [2] CRAN (R 4.0.2)
## stringr * 1.4.0 2019-02-10 [2] CRAN (R 4.0.0)
## tibble * 3.0.4.9000 2020-11-26 [1] Github (tidyverse/tibble@9eeef4d)
## tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [2] CRAN (R 4.0.0)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2)
## utf8 1.1.4 2018-05-24 [2] CRAN (R 4.0.0)
## vctrs 0.3.5.9000 2020-11-26 [1] Github (r-lib/vctrs@957baf7)
## viridisLite 0.3.0 2018-02-01 [2] CRAN (R 4.0.0)
## whisker 0.4 2019-08-28 [2] CRAN (R 4.0.0)
## withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
## xaringan 0.18 2020-10-21 [1] CRAN (R 4.0.2)
## xfun 0.19 2020-10-30 [1] CRAN (R 4.0.2)
## xml2 1.3.2 2020-04-23 [2] CRAN (R 4.0.0)
## xtable 1.8-4 2019-04-21 [2] CRAN (R 4.0.0)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
##
## [1] /Users/etan0038/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

(Scroll on html slide to see all)

47/47


Advent of "Grammar"


Bridging Statistics and Data Science for the Design of Experiments

Emi Tanaka

Department of Econometrics and Business Statistics

emi.tanaka@monash.edu @statsgen

27th Nov 2020 @ Statistical Society of Australia


1/47
Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k Go to previous slide
↓, β†’, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow