A tool to easily simulate valid experimental data

Emi Tanaka

Biological Data Science Institute

30th November 2023

Motivation

Designing an experiment (Statistician’s version)



Today is about the lone statistician ensuring their proposed experimental design is sensible.

Backcasting

Define a desirable future and then works backwards.

Use simulated data to check your experimental design

  • Consider backcasting as another tool for diagnosing experimental design.
  • But simulating experimental data isn’t routine practice! Why?
  • Perhaps it requires too much effort in practice for regular adoption.

Experimental Design

An example

Glasshouse tomato production*

Aim: Study production of glasshouse tomatoes under different air and soil temperatures.

  • There are 8 glasshouse compartments.
  • Each compartment contained two large troughs where tomatoes are grown.
  • For each trough, half of the soil would be heated or unheated.
  • From each half-trough (which we call plot), 3 tomatoes will be sampled for observations.
  • The compartment was kept at minimum of either 13°C or 16°C.
  • Heated soil was either kept at 18°C or 21°C.

Encoding the experimental design

edibble R package will be used to encode the experimental design (then later simulate the experimental data).

library(edibble)




Unit structure

tomato_units <- set_units(compartment = 8,
                               trough = nested_in(compartment, 2),
                                 plot = nested_in(trough, 2),
                               tomato = nested_in(plot, 3))
  • 8 compartments
  • each with 2 troughs
  • each trough is separated to 2 plots
  • from each plot, 3 tomatoes will be sampled

Treatment structure

tomato_trts <- set_trts(air_temp = c("13°C", "16°C"),
                          heated = c("Yes", "No"),
                       soil_temp = conditioned_on(heated,
                                          "Yes" ~ c("18°C", "21°C"),
                                           "No" ~ "Ambient"))
  • air temperature of 13°C or 16°C
  • half of the soil is heated or unheated
  • if heated, soil temperature is kept at 18°C or 21°C
  • if unheated, kept ambient.

Mapping units to treatments

tomato_mapping <- allot_trts(air_temp ~ compartment,
                               heated ~ plot,
                            soil_temp ~ plot)
  • different air temperature allocated to compartment
  • half of soil (plot ) in each trough is heated
  • different soil temperature on heated plot

tomato_assign1 <- tomato_mapping + 
  assign_trts(order = c("random", "random", "systematic-slowest"), seed = 1) +
  serve_table(label_nested = c(plot, trough))

tomato_assign2 <- tomato_mapping + 
  assign_trts(order = c("random", "random", "blocksdesign"), seed = 1) +
  serve_table(label_nested = c(plot, trough))
  • assignment algorithms used for each allotment (not the main point of this talk)

Experimental design for the tomato experiment

tomato_design1 <- tomato_units + tomato_trts + tomato_assign1
tomato_design2 <- tomato_units + tomato_trts + tomato_assign2

Measurements recorded

tomato_expt <- tomato_design2 %>% 
  set_rcrds_of(tomato = c("shape", "weight"),
               plot = c("ntomatoes", "yield")) %>% 
  expect_rcrds(factor(shape, levels = c("flat", "round", "other")),
               yield > 0, yield < 10, weight > 0, weight < 1000,
               ntomatoes >= 0L)
  • The shape and weight of each sampled tomato will be recorded.
  • The number of tomatoes and yield from each plot will be recorded.

⚖️ Domain experts (or the literature) generally have some idea about sensible values or distribution of the experimental data.

  • The shape will be recorded as flat, round, or other.
  • The yield is expected to be between 0 and 10kg and weight is between 0 and 1000g.
  • The number of tomatoes is expected to be a positive integer.

Simulation Tool

Simulating experimental data

  • 👩‍💻 Simulating experimental data (if done at all) are often done in a bespoke manner.
compartment trough plot tomato air_temp heated soil_temp
compartment1 trough1 plot1 tomato01 13°C Yes 21°C
compartment1 trough1 plot1 tomato02 13°C Yes 21°C
compartment1 trough1 plot1 tomato03 13°C Yes 21°C
compartment1 trough1 plot2 tomato04 13°C No Ambient
compartment1 trough1 plot2 tomato05 13°C No Ambient
compartment1 trough1 plot2 tomato06 13°C No Ambient
compartment1 trough2 plot1 tomato07 13°C No Ambient
compartment1 trough2 plot1 tomato08 13°C No Ambient
compartment1 trough2 plot1 tomato09 13°C No Ambient
compartment1 trough2 plot2 tomato10 13°C Yes 18°C
compartment1 trough2 plot2 tomato11 13°C Yes 18°C
compartment1 trough2 plot2 tomato12 13°C Yes 18°C
compartment2 trough1 plot1 tomato13 13°C Yes 18°C
compartment2 trough1 plot1 tomato14 13°C Yes 18°C
compartment2 trough1 plot1 tomato15 13°C Yes 18°C
compartment2 trough1 plot2 tomato16 13°C No Ambient
compartment2 trough1 plot2 tomato17 13°C No Ambient
compartment2 trough1 plot2 tomato18 13°C No Ambient
compartment2 trough2 plot1 tomato19 13°C Yes 21°C
compartment2 trough2 plot1 tomato20 13°C Yes 21°C
compartment2 trough2 plot1 tomato21 13°C Yes 21°C
compartment2 trough2 plot2 tomato22 13°C No Ambient
compartment2 trough2 plot2 tomato23 13°C No Ambient
compartment2 trough2 plot2 tomato24 13°C No Ambient
compartment3 trough1 plot1 tomato25 16°C No Ambient
compartment3 trough1 plot1 tomato26 16°C No Ambient
compartment3 trough1 plot1 tomato27 16°C No Ambient
compartment3 trough1 plot2 tomato28 16°C Yes 18°C
compartment3 trough1 plot2 tomato29 16°C Yes 18°C
compartment3 trough1 plot2 tomato30 16°C Yes 18°C
compartment3 trough2 plot1 tomato31 16°C No Ambient
compartment3 trough2 plot1 tomato32 16°C No Ambient
compartment3 trough2 plot1 tomato33 16°C No Ambient
compartment3 trough2 plot2 tomato34 16°C Yes 21°C
compartment3 trough2 plot2 tomato35 16°C Yes 21°C
compartment3 trough2 plot2 tomato36 16°C Yes 21°C
compartment4 trough1 plot1 tomato37 13°C No Ambient
compartment4 trough1 plot1 tomato38 13°C No Ambient
compartment4 trough1 plot1 tomato39 13°C No Ambient
compartment4 trough1 plot2 tomato40 13°C Yes 21°C
compartment4 trough1 plot2 tomato41 13°C Yes 21°C
compartment4 trough1 plot2 tomato42 13°C Yes 21°C
compartment4 trough2 plot1 tomato43 13°C Yes 18°C
compartment4 trough2 plot1 tomato44 13°C Yes 18°C
compartment4 trough2 plot1 tomato45 13°C Yes 18°C
compartment4 trough2 plot2 tomato46 13°C No Ambient
compartment4 trough2 plot2 tomato47 13°C No Ambient
compartment4 trough2 plot2 tomato48 13°C No Ambient
compartment5 trough1 plot1 tomato49 13°C Yes 21°C
compartment5 trough1 plot1 tomato50 13°C Yes 21°C
compartment5 trough1 plot1 tomato51 13°C Yes 21°C
compartment5 trough1 plot2 tomato52 13°C No Ambient
compartment5 trough1 plot2 tomato53 13°C No Ambient
compartment5 trough1 plot2 tomato54 13°C No Ambient
compartment5 trough2 plot1 tomato55 13°C Yes 18°C
compartment5 trough2 plot1 tomato56 13°C Yes 18°C
compartment5 trough2 plot1 tomato57 13°C Yes 18°C
compartment5 trough2 plot2 tomato58 13°C No Ambient
compartment5 trough2 plot2 tomato59 13°C No Ambient
compartment5 trough2 plot2 tomato60 13°C No Ambient
compartment6 trough1 plot1 tomato61 16°C Yes 18°C
compartment6 trough1 plot1 tomato62 16°C Yes 18°C
compartment6 trough1 plot1 tomato63 16°C Yes 18°C
compartment6 trough1 plot2 tomato64 16°C No Ambient
compartment6 trough1 plot2 tomato65 16°C No Ambient
compartment6 trough1 plot2 tomato66 16°C No Ambient
compartment6 trough2 plot1 tomato67 16°C Yes 21°C
compartment6 trough2 plot1 tomato68 16°C Yes 21°C
compartment6 trough2 plot1 tomato69 16°C Yes 21°C
compartment6 trough2 plot2 tomato70 16°C No Ambient
compartment6 trough2 plot2 tomato71 16°C No Ambient
compartment6 trough2 plot2 tomato72 16°C No Ambient
compartment7 trough1 plot1 tomato73 16°C No Ambient
compartment7 trough1 plot1 tomato74 16°C No Ambient
compartment7 trough1 plot1 tomato75 16°C No Ambient
compartment7 trough1 plot2 tomato76 16°C Yes 21°C
compartment7 trough1 plot2 tomato77 16°C Yes 21°C
compartment7 trough1 plot2 tomato78 16°C Yes 21°C
compartment7 trough2 plot1 tomato79 16°C No Ambient
compartment7 trough2 plot1 tomato80 16°C No Ambient
compartment7 trough2 plot1 tomato81 16°C No Ambient
compartment7 trough2 plot2 tomato82 16°C Yes 18°C
compartment7 trough2 plot2 tomato83 16°C Yes 18°C
compartment7 trough2 plot2 tomato84 16°C Yes 18°C
compartment8 trough1 plot1 tomato85 16°C No Ambient
compartment8 trough1 plot1 tomato86 16°C No Ambient
compartment8 trough1 plot1 tomato87 16°C No Ambient
compartment8 trough1 plot2 tomato88 16°C Yes 21°C
compartment8 trough1 plot2 tomato89 16°C Yes 21°C
compartment8 trough1 plot2 tomato90 16°C Yes 21°C
compartment8 trough2 plot1 tomato91 16°C No Ambient
compartment8 trough2 plot1 tomato92 16°C No Ambient
compartment8 trough2 plot1 tomato93 16°C No Ambient
compartment8 trough2 plot2 tomato94 16°C Yes 18°C
compartment8 trough2 plot2 tomato95 16°C Yes 18°C
compartment8 trough2 plot2 tomato96 16°C Yes 18°C

Some bespoke approaches to simulating data:

library(dplyr)
sim_bespoke <- tomato_expt %>% 
  mutate(yield_same = 1) %>% 
  mutate(yield_null = rnorm(n(), 5, 1)) %>% 
  mutate(yield_trts = 5 + c("13°C" = -1.1, "16°C" = 0)[air_temp] + 
           c("18°C" = 1.3, "21°C" = 2.1, "Ambient" = 0)[soil_temp] + 
           rnorm(n(), 0, 1)) %>% 
  mutate(yield_sensible = 5 + yield_trts + 
           rnorm(nlevels(compartment), 0, 1)[index_levels(compartment)] +
           rnorm(nlevels(trough), 0, 1)[index_levels(trough)] + 
           rnorm(nlevels(plot), 0, 1)[index_levels(plot)])

yield_same: A quick dummy response data. yield_null: Ignore influence of treatment and unit structures. yield_trts: Considers the treatment effects (more realistic). yield_sensible: Adds possible unit variations on top of the treatment effects (more sensible).

Analysis of simulated data

summary(aov(yield_same ~ air_temp * soil_temp + 
          Error(compartment/trough/plot), sim_bespoke))

Error: compartment
          Df    Sum Sq   Mean Sq F value Pr(>F)
air_temp   1 3.782e-30 3.782e-30       1  0.356
Residuals  6 2.269e-29 3.782e-30               

Error: compartment:trough
                   Df    Sum Sq   Mean Sq F value Pr(>F)
soil_temp           1 3.782e-30 3.782e-30       1  0.356
air_temp:soil_temp  1 3.782e-30 3.782e-30       1  0.356
Residuals           6 2.269e-29 3.782e-30               

Error: compartment:trough:plot
                   Df    Sum Sq   Mean Sq F value Pr(>F)
soil_temp           2 7.560e-30 3.782e-30       1  0.397
air_temp:soil_temp  2 7.560e-30 3.782e-30       1  0.397
Residuals          12 4.538e-29 3.782e-30               

Error: Within
          Df   Sum Sq   Mean Sq F value Pr(>F)
Residuals 64 2.42e-28 3.782e-30               
summary(aov(yield_null ~ air_temp * soil_temp + 
          Error(compartment/trough/plot), sim_bespoke))

Error: compartment
          Df Sum Sq Mean Sq F value Pr(>F)
air_temp   1  0.169  0.1691   0.093  0.771
Residuals  6 10.957  1.8261               

Error: compartment:trough
                   Df Sum Sq Mean Sq F value Pr(>F)
soil_temp           1  0.432  0.4325   0.260  0.628
air_temp:soil_temp  1  0.080  0.0803   0.048  0.833
Residuals           6  9.989  1.6649               

Error: compartment:trough:plot
                   Df Sum Sq Mean Sq F value Pr(>F)
soil_temp           2  0.618  0.3090   0.232  0.797
air_temp:soil_temp  2  1.391  0.6953   0.521  0.607
Residuals          12 16.004  1.3337               

Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 64   60.4  0.9437               
summary(aov(yield_trts ~ air_temp * soil_temp + 
          Error(compartment/trough/plot), sim_bespoke))

Error: compartment
          Df Sum Sq Mean Sq F value  Pr(>F)   
air_temp   1 15.873  15.873   20.07 0.00419 **
Residuals  6  4.745   0.791                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error: compartment:trough
                   Df Sum Sq Mean Sq F value   Pr(>F)    
soil_temp           1  2.899   2.899   63.74 0.000206 ***
air_temp:soil_temp  1  3.815   3.815   83.87 9.54e-05 ***
Residuals           6  0.273   0.045                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error: compartment:trough:plot
                   Df Sum Sq Mean Sq F value   Pr(>F)    
soil_temp           2  48.71  24.357  63.872 4.01e-07 ***
air_temp:soil_temp  2   1.08   0.542   1.422    0.279    
Residuals          12   4.58   0.381                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 64  101.5   1.586               
summary(aov(yield_sensible ~ air_temp * soil_temp + 
          Error(compartment/trough/plot), sim_bespoke))

Error: compartment
          Df Sum Sq Mean Sq F value Pr(>F)
air_temp   1   0.67   0.674    0.03  0.869
Residuals  6 136.41  22.735               

Error: compartment:trough
                   Df Sum Sq Mean Sq F value Pr(>F)  
soil_temp           1  29.33  29.331   7.447 0.0342 *
air_temp:soil_temp  1  29.63  29.630   7.523 0.0336 *
Residuals           6  23.63   3.938                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error: compartment:trough:plot
                   Df Sum Sq Mean Sq F value  Pr(>F)   
soil_temp           2  82.23   41.11  11.275 0.00176 **
air_temp:soil_temp  2   3.74    1.87   0.514 0.61096   
Residuals          12  43.76    3.65                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 64  101.5   1.586               

Some simulated experimental data are not valid values

  • Notice that yield is measured on the plot level, but variation exists within plot.

Writing the simulation process with edibble

tomato_sim <- tomato_expt %>% 
  simulate_process(
    yield = function(air = -1.1, soil = c(1.3, 2.1)) {
      air_effects <- c("13°C" = air, "16°C" = 0)
      soil_effects <- c("18°C" = soil[1], "21°C" = soil[2], "Ambient" = 0)
      compartment_noise <- rnorm(nlevels(compartment), 0, 1)
      trough_noise <- rnorm(nlevels(trough), 0, 1)
      plot_noise <- rnorm(nlevels(plot), 0, 1)
      tomato_noise <- rnorm(nlevels(tomato), 0, 1)
      
      5 + air_effects[air_temp] + soil_effects[soil_temp] + 
        compartment_noise[index_levels(compartment)] +
        trough_noise[index_levels(trough)] +
        plot_noise[index_levels(plot)] +
        tomato_noise[index_levels(tomato)]
  })

Simulating data with edibble

tomato_sim1 <- tomato_sim %>% 
  simulate_rcrds(yield = with_params(air = -3), .nsim = 2, .seed = 2023)

tomato_sim2 <- tomato_sim %>% 
  simulate_rcrds(yield = with_params(.aggregate = median, # default = mean
                                     .censor = c(0, 10)), # default = NA
                 .nsim = 2, .seed = 2023)




A tool to easily simulate valid experimental data




A tool to easily simulate valid experimental data

Filling forms

Autofill saves time!

Why can’t we have an automated good-enough simulation?

Autofill records with edibble

(simauto <- autofill_rcrds(tomato_expt, .nsim = 1, .seed = 1))
# An edibble: 96 x 12
    compartment  trough    plot   tomato air_temp heated soil_temp shape weight ntomatoes yield  .sim
         <U(8)> <U(16)> <U(32)>  <U(96)>   <T(2)> <T(2)>    <T(3)>                                   
          <chr>   <chr>   <chr>    <chr>    <chr>  <chr>     <chr> <fct>  <dbl>     <int> <dbl> <int>
 1 compartment1 trough1   plot1 tomato01     13°C    Yes   21°C    flat    304.        38 5.40      1
 2 compartment1 trough1   plot1 tomato02     13°C    Yes   21°C    other   466.        38 5.40      1
 3 compartment1 trough1   plot1 tomato03     13°C    Yes   21°C    flat    405.        38 5.40      1
 4 compartment1 trough1   plot2 tomato04     13°C    No    Ambient flat    574.        18 3.96      1
 5 compartment1 trough1   plot2 tomato05     13°C    No    Ambient flat    482.        18 3.96      1
 6 compartment1 trough1   plot2 tomato06     13°C    No    Ambient other   299.        18 3.96      1
 7 compartment1 trough2   plot1 tomato07     13°C    No    Ambient round   377.        51 2.82      1
 8 compartment1 trough2   plot1 tomato08     13°C    No    Ambient round   488.        51 2.82      1
 9 compartment1 trough2   plot1 tomato09     13°C    No    Ambient flat    613.        51 2.82      1
10 compartment1 trough2   plot2 tomato10     13°C    Yes   18°C    round   522.        22 0.795     1
# ℹ 86 more rows

summary(aov(yield ~ air_temp * soil_temp + Error(compartment/trough/plot), data = simauto))

Error: compartment
          Df Sum Sq Mean Sq F value Pr(>F)
air_temp   1   7.37   7.372   0.853  0.391
Residuals  6  51.87   8.646               

Error: compartment:trough
                   Df Sum Sq Mean Sq F value Pr(>F)
soil_temp           1  16.00   16.00   0.811  0.403
air_temp:soil_temp  1  59.52   59.52   3.017  0.133
Residuals           6 118.39   19.73               

Error: compartment:trough:plot
                   Df Sum Sq Mean Sq F value Pr(>F)  
soil_temp           2  84.94   42.47   3.956 0.0479 *
air_temp:soil_temp  2  80.90   40.45   3.768 0.0537 .
Residuals          12 128.83   10.74                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error: Within
          Df    Sum Sq   Mean Sq F value Pr(>F)
Residuals 64 1.355e-27 2.117e-29               

:::

Examining the process and underlying values

examine_process(simauto, "yield")
simulate_process(
   yield = function () 
   {
       plot_effects <- rnorm(32, 0, 7.5)
       air_temp_effects <- rnorm(2, 0, 4.7)
       y <- plot_effects[index_levels(plot)] + air_temp_effects[index_levels(air_temp)]
       rescale_values(y, lower = 0, upper = 10)
   } )
examine_process_values(simauto, "yield", sim = 1)
$air_temp_effects
[1] 3.487784 0.693595

$plot_effects
 [1]   2.63182298  -1.30910197  -4.43571353 -10.00520446  -8.22973876  15.27077707  -2.44867195   5.80503909   5.88754801   5.72434560   2.21106570  -9.39266943  -7.57127815   5.63543396  -9.81265134   3.95655073  -4.00154680  -2.98782011  -5.92177088  -1.72605852   6.57888631   3.40299883  -1.74348111   6.52504144  12.42002801  -0.04776697   3.52867090   2.08663987  -7.33427206  -6.94939607  14.39827847   6.60958341

$y
 [1]  6.1196073  6.1196073  6.1196073  2.1786824  2.1786824  2.1786824 -0.9479292 -0.9479292 -0.9479292 -6.5174201 -6.5174201 -6.5174201 -4.7419544 -4.7419544 -4.7419544 18.7585614 18.7585614 18.7585614  1.0391124  1.0391124  1.0391124  9.2928234  9.2928234  9.2928234  6.5811430  6.5811430  6.5811430  6.4179406  6.4179406  6.4179406  2.9046607  2.9046607  2.9046607 -8.6990744 -8.6990744 -8.6990744 -4.0834938 -4.0834938 -4.0834938  9.1232183  9.1232183  9.1232183 -6.3248670 -6.3248670 -6.3248670  7.4443351  7.4443351  7.4443351 -0.5137625 -0.5137625 -0.5137625  0.4999642  0.4999642  0.4999642 -2.4339866 -2.4339866 -2.4339866  1.7617258  1.7617258  1.7617258  7.2724813  7.2724813  7.2724813  4.0965938  4.0965938  4.0965938 -1.0498861 -1.0498861 -1.0498861  7.2186364  7.2186364  7.2186364 13.1136230 13.1136230 13.1136230  0.6458280  0.6458280  0.6458280  4.2222659  4.2222659  4.2222659  2.7802349  2.7802349  2.7802349 -6.6406771 -6.6406771 -6.6406771 -6.2558011 -6.2558011 -6.2558011
[91] 15.0918735 15.0918735 15.0918735  7.3031784  7.3031784  7.3031784

Key Takeaways

Summary

  • Much like in backcasting, simulating data is useful for experimental planning.
  • edibble R package encapsulates elements of experimental designs in a series of composable functions.
  • edibble::autofill_rcrds() provides a quick, lazy simulation for checking or revising the statistical analysis plan for free, if your experiment is designed using the edibble R package.
  • (Obviously don’t use it for any serious simulation studies).
  • For a more customisable option, you can use simulate_process() + simulate_rcrds() that include censoring and aggregation options that respects the experimental structure and expectations.

Thanks for listening!

These slides are available at emitanaka.org/slides/biometrics2023.

More details also in arXiv papers:

edibble R package is available on CRAN with the latest developments on GitHub at github.com/emitanaka/edibble

install.packages("edibble")
remotes::install_github("emitanaka/edibble") # latest development


emi.tanaka@anu.edu.au