We saw before that we can fit a model that borrows strength across sites for a more accurate prediction of genotype by site effects:
fit2 <- asreml(Yield ~ Site, random=~ us(Site):id(Geno), data=data1)
Assume that there are m genotypes across t sites (not all genotypes appear in each site). We can write the model more generally as
where we assume a separable variance structure for genotype-by-environment effects:
For some order k, you can replace the unstructured covariance with factor analytic form:
asreml
constrains such that the upper triangle of the loading matrix are zeroes. Trial, site and environment are used synonymously.
We expect that t > k.
Hui et al. (2018) Order Selection and Sparsity in Latent Variable Models via the Ordered Factor LASSO. Biometrics
So what to do?
Steps:
Good question!
The two are the same under certain case in fact (can you tell when?).
If trials are completely disconnected then between environment genetic covariance cannot be reliably computed.
The genetic regression residuals represent non-repeatable variety effects for the given the model and set of environments.
caige <- read.csv("tutorials/CAIGE2017.csv") %>% # change to your relative path mutate(Row=factor(Row), Column=factor(Column), Year=factor(Year)) skimr::skim(caige)
Skim summary statistics n obs: 2484 n variables: 6 ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────── variable missing complete n n_unique ordered Column 0 2484 2484 24 FALSE Geno 0 2484 2484 235 FALSE Row 0 2484 2484 32 FALSE Trial 0 2484 2484 7 FALSE Year 0 2484 2484 1 FALSE ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────── variable missing complete n mean sd p0 p25 p50 p75 p100 Yield 3 2481 2484 3.77 1.58 0.51 2.41 3.73 4.82 8.68 hist ▁▇▆▇▆▂▂▁
Source: CIMMYT Australia ICARDA Germplasm Evaluation (CAIGE) Project
Shiny App to explore the CAIGE data
Yield is measured in tonnes per hectre.
gg <- desplot(Yield ~ Column + Row | Trial, data=caige, main="", gg=T) + theme_bw()plotly::ggplotly(gg)
Get to know your data well before modelling.
Check for outliers and follow up with the data manager.
For this, we will assume the data is well behaved and there are no outliers to keep it simple, but of course, in practice that should not be the case.
tt <- table(caige$Geno, caige$Trial) head(tt, 2)
Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay G1 2 2 2 2 2 2 2 G10 2 2 2 2 3 2 2
prep <- function(x) 100 * round(prop.table(table(factor(x, levels=1:3))), 3)apply(tt, 2, prep)
Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay 1 44.1 48.4 32.8 21.4 38.9 35.9 44.8 2 54.0 51.6 66.8 77.8 59.3 64.1 55.2 3 1.9 0.0 0.4 0.9 1.8 0.0 0.0
The trials employ a partially replicated design.
Most genotype is replicated 1-2 times at each trial with small portion replicated 3 times.
tt <- tt > 0conc <- t(tt) %*% ttconc
Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay Balaklava 213 190 213 213 213 213 201 Horsham 190 190 190 190 190 190 190 Junee 213 190 229 229 221 229 201 Narrabri 213 190 229 234 221 233 201 Northstar 213 190 221 221 221 221 201 Roseworthy 213 190 229 233 221 234 201 Toodyay 201 190 201 201 201 201 201
Sometimes simply referred to as "connectivity".
fit_diag <- asreml(Yield ~ Trial, data=caige, trace=F, random=~at(Trial):Row + at(Trial):Column + diag(Trial):Geno, rcov=~at(Trial):ar1(Column):ar1(Row))fit_diag <- update(fit_diag)vc_diag <- lucid::vc(fit_diag) %>% rename(Gamma=effect)
Row
and random Column
effects along with separable autoregressive process of order one are sufficient.Balaklava
, Roseworthy
and Toodyay
exhibit concerns with low heritability. There could be more done (e.g. discussing with trial managers and experts) to address this point before proceeding with the FA model. sv_fa1 <- asreml(Yield ~ Trial, data=caige, random=~at(Trial):Row + at(Trial):Column + fa(Trial, 1):Geno, start.values=T, rcov=~at(Trial):ar1(Column):ar1(Row))# replacing some initial values from the DIAG modelsv_fa1$gammas.table <- sv_fa1$gammas.table %>% left_join(vc_diag, by="Gamma") %>% mutate(Value=ifelse(!is.na(component), component, Value)) %>% select(Gamma, Value, Constraint)
fit_fa1 <- asreml(Yield ~ Trial, data=caige, trace=F, random=~at(Trial):Row + at(Trial):Column + fa(Trial,1 ):Geno, rcov=~at(Trial):ar1(Column):ar1(Row), G.param=sv_fa1$gammas.table, R.param=sv_fa1$gammas.table)vc_fa1 <- lucid::vc(fit_fa1) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))
sum_fa1 <- myf::summary.fa(fit_fa1)sum_fa1$gammas[[1]]$`site %vaf`
fac_1 Balaklava 51.67980 Horsham 74.87380 Junee 70.12005 Narrabri 58.74877 Northstar 11.26292 Roseworthy 17.74837 Toodyay 55.19469
sum_fa1$gammas[[1]]$`total %vaf`
[1] 52.26278
sv_fa2 <- asreml(Yield ~ Trial, data=caige, random=~at(Trial):Row + at(Trial):Column + fa(Trial, 2):Geno, start.values=T, rcov=~at(Trial):ar1(Column):ar1(Row))# need to change fa(Trial, 1) to fa(Trial, 2)# so it will match upvc_fa1 <- vc_fa1 %>% mutate( Gamma=ifelse( grepl("fa(Trial, 1)", Gamma, fixed=T), gsub("fa(Trial, 1)", "fa(Trial, 2)", Gamma, fixed=T), Gamma))# replacing some initial values from the FA1 modelsv_fa2$gammas.table <- sv_fa2$gammas.table %>% left_join(vc_fa1, by="Gamma") %>% mutate(Value=ifelse(!is.na(component), component, Value)) %>% select(Gamma, Value, Constraint)
fit_fa2 <- asreml(Yield ~ Trial, data=caige, trace=F, random=~at(Trial):Row + at(Trial):Column + fa(Trial, 2):Geno, rcov=~at(Trial):ar1(Column):ar1(Row), G.param=sv_fa2$gammas.table, R.param=sv_fa2$gammas.table)fit_fa2 <- update(fit_fa2)vc_fa2 <- lucid::vc(fit_fa2) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))
sum_fa2 <- myf::summary.fa(fit_fa2)sum_fa2$gammas[[1]]$`site %vaf`
sum_fa2$gammas[[1]]$`total %vaf`
[1] 64.3899
sv_fa3 <- asreml(Yield ~ Trial, data=caige, random=~at(Trial):Row + at(Trial):Column + fa(Trial, 3):Geno, start.values=T, rcov=~at(Trial):ar1(Column):ar1(Row))# need to change fa(Trial, 2) to fa(Trial, 3)# so it will match upvc_fa2 <- vc_fa2 %>% mutate( Gamma=ifelse( grepl("fa(Trial, 2)", Gamma, fixed=T), gsub("fa(Trial, 2)", "fa(Trial, 3)", Gamma, fixed=T), Gamma))# replacing some initial values from the FA1 modelsv_fa3$gammas.table <- sv_fa3$gammas.table %>% left_join(vc_fa2, by="Gamma") %>% mutate(Value=ifelse(!is.na(component), component, Value)) %>% select(Gamma, Value, Constraint)
fit_fa3 <- asreml(Yield ~ Trial, data=caige, trace=F, random=~at(Trial):Row + at(Trial):Column + fa(Trial, 3):Geno, rcov=~at(Trial):ar1(Column):ar1(Row), G.param=sv_fa3$gammas.table, R.param=sv_fa3$gammas.table)vc_fa3 <- lucid::vc(fit_fa3) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))
sum_fa3 <- myf::summary.fa(fit_fa3)sum_fa3$gammas[[1]]$`site %vaf`
sum_fa3$gammas[[1]]$`total %vaf`
[1] 71.5777
sum_fa3 <- myf::summary.fa(fit_fa3, heatmap.ord="cluster", g.list=c("G124", "G148", "G8", "G152", "G158", "G183"))sum_fa3$heatmaps$`fa(Trial, 3):Geno`
agnes
clustering algorithm based on the estimated between environment genetic correlation matrix. sum_fa3$gammas[[1]]$Cmat
sum_fa3$regplots$`fa(Trial, 3):Geno`
ggplot
style)sum_fa3 <- myf::summary.fa(fit_fa3
To get the (regression) BLUP:
sum_fa3$blups[[1]]$blups.inmet
To get the (rotated) scores:
sum_fa3$blups[[1]]$scores.inmet
To get the rotated loadings:
sum_fa3$gammas[[1]]$`rotated loads`
"All models are wrong but some are useful." -George Box
These slides were made using the R package xaringan
with the ninja-themes
and is available at bit.ly/UT-WS-FAmodel
.
Download
For workshop participants, contact Emi for the tutorials.
day3-session02-FAmodel-tutorial.Rmd
here, open in RStudio, push the button "Run Document" on the top tab and work through the exercises.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |