class: split-60 title-slide2 with-border white
background-image: url("images/bg3.jpg")
background-size: cover

.column.shade_main[.content[
<br><br>

# Statistical Methods for Omics Assisted Breeding

## <b style='color:#FFEB3B'>Factor Analytic Model</b>

### <br>Emi Tanaka <br>emi.tanaka@sydney.edu.au <br>School of Mathematics and Statisitcs

### 2018/11/14 <br>
### <span style='font-size:14pt; color:pink!important'>These slides may take a while to render properly. You can find the pdf <a href="https://www.dropbox.com/s/j7tktazymnu9n67/day3-session02-FAmodel.pdf?dl=1">here</a>.</span> <br><br>
### <a rel='license' href='http://creativecommons.org/licenses/by-sa/4.0/'><img alt='Creative Commons License' style='border-width:0; width:30pt' src='images/cc.svg' /><img alt='Creative Commons License' style='border-width:0; width:20pt' src='images/by.svg' /><img alt='Creative Commons License' style='border-width:0; width:20pt' src='images/sa.svg' /></a><span style='font-size:10pt'>&nbsp;&nbsp;&nbsp;&nbsp;This work by <span xmlns:cc='http://creativecommons.org/ns#' property='cc:attributionName'>Emi Tanaka</span> is licensed under a <a rel='license' href='http://creativecommons.org/licenses/by-sa/4.0/'>Creative Commons Attribution-ShareAlike 4.0 International License</a>.</span>

]]
.column[.content[

]]
<img src="images/USydLogo-white.svg" style="position:absolute; top:80%; left:40%;width:200px">

---

.column.bg-main3[
]
.column.slide-in-right[
.sliderbox.bg-main2.vmiddle[
# .font_large[Theory]
]]

---

.column[.content[
# MET Model
We saw before that we can fit a model that borrows strength across sites for a more *accurate prediction* of genotype by site effects:

```r
fit2 <- asreml(Yield ~ Site, 
               random=~ us(Site):id(Geno),
               data=data1)
```
<center>
<img src="images/assum2v2.png" width="50%">
</center>
]]
.column.bg-blue[.content[

]]

---

.column[.content[
# The general MET Model
Assume that there are .indigo[*m* genotypes] across .indigo[*t* sites] (not all genotypes appear in each site). We can write the model more generally as
<img src="images/model2.png" width="100%"/>
where we assume a separable variance structure for genotype-by-environment effects:
<center>
<img src="images/ugxe.png" width="50%"/>
</center>
]]
.column.bg-blue[.content[
* **G**<sub>e</sub> may be assumed a general matrix such as unstructured matrix that is usually estimated from the data.
* **G**<sub>g</sub> is a known genotype relationship matrix. 
]]
---

.column[.content[
# Unstructured Covariance Model
<br>
<iframe src="html_extras/table5.html" width="100%" height="500!important" frameBorder="0"> </iframe>

]]
.column.bg-blue[.content[

* The number of parameters to be estimated grows .yellow[quadratically] with the number of trials so it quickly becomes too many parameters to estimate. 
* Recall covariances are symmetric so there is no need to estimate the parameters in the upper (or lower) triangle of covariance matrices.

<center>
<img src="images/usmodel.png" width="60%"/>
</center>
]]
---

.column[.content[
# Factor Analytic Model
For some .indigo[order *k*], you can replace the unstructured covariance with factor analytic form:
<br>
<center>
<img src="images/famodel.png" width="70%"/>
</center>
]]
.column.bg-blue[.content[
* Due to identifiability, some contraints are applied to the loading matrix. 
* `asreml` constrains such that the upper triangle of the loading matrix are zeroes. 
<center>
<img src="images/faconstraint.png" width="80%"/>
</center>
* So in effect there are <br> 
*(k + 1) t - k (k - 1) / 2* <br> 
parameters to estimate.
]]

---

.column[.content[
# The number of variance parameters to estimate 
<iframe src="html_extras/table7.html" width="100%" height="500!important" frameBorder="0"> </iframe>

]]
.column.bg-blue[.content[
* The number of variance parameters to estimate for FA model grows .yellow[linearly] with the number of trials.
* FA model can be considered a .yellow[lower order approximation] to the US model.
* As FA model is to offer a simpler model then it does not make sense to have more parameters to estimate in FA model than the US model.
]]

---

.column[.content[
# Condition to use FA model over US model
<center>
<img src="images/numpar.png" width="70%"/>
</center>
]]

.column.bg-blue[.content[
* *Trial*, *site* and *environment* are used synonymously.

* We expect that *t > k*. 
]]

---

.column[.content[
# Latent Variable Model
<br>
<center>
<img src="images/famodelassum.png" width="80%"/>
</center>
]]
.column.bg-blue[.content[
* FA Model is a special case of latent variable model when the responses are conditionally normally distributed.
* Note: our FA model is different to the standard FA model due to the separable structure of **G**<sub>ge</sub>.
]]

---

.column[.content[
# Latent Variable Model
<br>
<center>
<img src="images/famodelassum2.png" width="80%"/>
</center>
]]
.column.bg-blue[.content[
* Notice that this is like a linear regression model except the covariates are estimated from the data. 
* The loadings represent some latent environmental covariate. 
* The common factor represent how the genotype responds to that covariate. 
* The specific factor represent an effect specific to that environment.
]]

---

1. Pragmatically, you can use some threshold for overall percentage of between genetic variances explained by the *k* factors:
<center>
<img src="images/percfa.png" width="45%"/>
</center>
2. You can use a hypothesis testing approach or use of information criterion.
3. You can use OFAL penalty proposed in Hui et al. (2018).

.bottom_abs.width100.font_small[
Hui et al. (2018) Order Selection and Sparsity in Latent Variable Models via the Ordered Factor LASSO. *Biometrics*
]
]]
.column.bg-blue[.content[

* The first approach is akin to using coefficient of determination *R<sup>2</sup>* in linear regression.

]]

---

* Suppose that we have an orthogonal matrix **Q** of size *t*. 
* If we do not impose some constraint as per slide 5, then there are many possible solutions for factor loadings. 
* In fact, the rotated loadings <img src="images/QLam.png" height="24pt"/> and factor <img src="images/Qf.png" height="36pt"/> are also solutions.

So what to do?

* Many solutions. The approach we use in the practical session, we use an approach similar to principal components, in that:
    - the first rotated factor accounts for the maximum amount of estimated genetic covariance, 
    - the second accounts for the next largest amount of estimated genetic covariance and so on.

]]
.column.bg-blue[.content[
* A square matrix **Q** of size *t* is orthogonal if 
<center>
<img src="images/orthogonal.png" width="50%"/>
</center>
* If *k=1*, then there is no constraint applied and no rotation is necessary.

Steps: 
1. Constrain upper triangle of loading matrix to zero.
2. Rotate the estimated loading matrix. 
]]

---

Good question!

The two are the same under certain case in fact (can you tell when?).
]]
.row[.content[
.split-two[
.column[.content[
* Principal component analysis (PCA) is a transformation of the data.
* PCA transforms the variables to principal components. 
]]
.column[.content[
* Factor analytic model assumes that the data comes from a well-defined model where the underlying factors satisfy assumptions mentioned before. 
* The emphasis in factor analysis is that the factors map to the variables but specific factors are explicitly assumed to be "noise". 
]]
]
]]
]

]]
.column.bg-blue[.content[

]]

---

.column[.content[
# Between environment genetic correlation matrix
<center>
<img src="images/cov2cor.png" width="70%"/>
</center>

]]
.column.bg-blue[.content[
* Negative genetic correlation estimate indicate cross-over interaction.
* Positive genetic correlation estimate indicate noncross-over interaction.
* Estimating the between environment genetic covariance is dependent on the number of varieties in common between trials (which we refer to as *connectivity*). 
]]

---

* If trials are completely disconnected then between environment genetic covariance cannot be reliably computed.

* The genetic regression residuals represent non-repeatable variety effects for the given the model and set of environments.

### What set of trials for MET analysis?
* You would want a representative sample of environments, both in a geographic and seasonal sense, a relevant set of varieties and reasonable connectivity between pairs of trials.

]]
.column.bg-blue[.content[

]]

---

.column.bg-main3[
]
.column.slide-in-right[
.sliderbox.bg-main2.vmiddle[
# .font_large[Practice]
]]

---

```r
caige <- read.csv("tutorials/CAIGE2017.csv") %>% # change to your relative path
  mutate(Row=factor(Row), Column=factor(Column), Year=factor(Year)) 
skimr::skim(caige)
```

```
  Skim summary statistics
   n obs: 2484 
   n variables: 6 
  
  ── Variable type:factor ─────────────────────────────────────────────────────────────────────────────────────
   variable missing complete    n n_unique ordered
     Column       0     2484 2484       24   FALSE
       Geno       0     2484 2484      235   FALSE
        Row       0     2484 2484       32   FALSE
      Trial       0     2484 2484        7   FALSE
       Year       0     2484 2484        1   FALSE
  
  ── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────
   variable missing complete    n mean   sd   p0  p25  p50  p75 p100
      Yield       3     2481 2484 3.77 1.58 0.51 2.41 3.73 4.82 8.68
       hist
   ▁▇▆▇▆▂▂▁
```

]]
.column.bg-blue[.content[
.width100.font_small[
Source: [CIMMYT Australia ICARDA Germplasm Evaluation (CAIGE) Project](http://www.caigeproject.org.au/)<br><br><br><br><br><br>
[Shiny App to explore the CAIGE data](http://shiny.maths.usyd.edu.au/caige_explorer_2011_2017/)
<br><br><br><br><br><br>
Yield is measured in tonnes per hectre. 
]
]]

---

```r
gg <- desplot(Yield ~ Column + Row | Trial, data=caige,
              main="", gg=T) + theme_bw()
plotly::ggplotly(gg)
```
]]
.row[
.split-two[
.column[
<iframe src="plotly1.html" width="100%" height="500" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe>
]
.column.bg-white[
![](day3-session02-FAmodel_files/figure-html/unnamed-chunk-5-1.png)
]
]]]

]]
.column.bg-blue[.content.font_small[
Get to know your data well before modelling. 
<br><br><br><br>
Check for outliers and follow up with the data manager. 
<br><br><br><br>
For this, we will assume the data is well behaved and there are no outliers to keep it simple, but of course, in practice that should not be the case. 
]]

---

```r
tt <- table(caige$Geno, caige$Trial) 
head(tt, 2)
```

```
       
        Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
    G1          2       2     2        2         2          2       2
    G10         2       2     2        2         3          2       2
```

```r
prep <- function(x) 100 * round(prop.table(table(factor(x, levels=1:3))), 3)
apply(tt, 2, prep) 
```

```
     
      Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
    1      44.1    48.4  32.8     21.4      38.9       35.9    44.8
    2      54.0    51.6  66.8     77.8      59.3       64.1    55.2
    3       1.9     0.0   0.4      0.9       1.8        0.0     0.0
```

]]
.column.bg-blue[.content.font_small[
The trials employ a partially replicated design.
<br><br><br><br>
Most genotype is replicated 1-2 times at each trial with small portion replicated 3 times. 
]]

---

```r
tt <- tt > 0
conc <- t(tt) %*% tt
conc
```

```
            
             Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
  Balaklava        213     190   213      213       213        213     201
  Horsham          190     190   190      190       190        190     190
  Junee            213     190   229      229       221        229     201
  Narrabri         213     190   229      234       221        233     201
  Northstar        213     190   221      221       221        221     201
  Roseworthy       213     190   229      233       221        234     201
  Toodyay          201     190   201      201       201        201     201
```
Sometimes simply referred to as "connectivity". 
]]
.column.bg-blue[.content.font_small[

]]

---

![](day3-session02-FAmodel_files/figure-html/unnamed-chunk-8-1.png)

]]
.column.bg-blue[.content.font_sm80[

* Diagonal entries show the number of genotypes at the corresponding trial. 
* Off diagonal entries show the number of genotypes common between the trials shown in the row and column labels. 
* The connectivity between all pairs of trial is good so we should not have much problem estimating the genetic covariance between trials.

]]

---

```r
fit_diag <- asreml(Yield ~ Trial, data=caige, trace=F,
 random=~at(Trial):Row + at(Trial):Column +
   diag(Trial):Geno,
 rcov=~at(Trial):ar1(Column):ar1(Row))
fit_diag <- update(fit_diag)
vc_diag <- lucid::vc(fit_diag) %>% rename(Gamma=effect)
```
<iframe src="html_extras/table8.html" width="100%" height="300!important" frameBorder="0"> </iframe>

]]
.column.bg-blue.font_sm80[.content[

* In the first instance, we will normally do spatial modelling however we have not here and assume that the addition of random `Row` and random `Column` effects along with separable autoregressive process of order one are sufficient.
* `Balaklava`, `Roseworthy` and `Toodyay` exhibit concerns with low heritability. There could be more done (e.g. discussing with trial managers and experts) to address this point before proceeding with the FA model. 
* For simplicity, we shall assume there is no concern and proceed on.
]]

---

.column[.content[
# Modelling: Initialising the FA1 model
* Recall bad starting values in Newton-Raphson can make convergence go stray.
* When fitting complex model, we will build it up from simpler ones using the fit of the simpler models as initial values for complex ones.

```r
sv_fa1 <- asreml(Yield ~ Trial, data=caige,
 random=~at(Trial):Row + at(Trial):Column +
*  fa(Trial, 1):Geno, start.values=T,
 rcov=~at(Trial):ar1(Column):ar1(Row))

# replacing some initial values from the DIAG model
sv_fa1$gammas.table <- sv_fa1$gammas.table %>% 
  left_join(vc_diag, by="Gamma") %>%
  mutate(Value=ifelse(!is.na(component), component, Value)) %>% 
  select(Gamma, Value, Constraint)
```

]]
.column.bg-blue.font_sm80[.content[

* This step is not absolutely critical but can be helpful in particular with larger data, as starting from a value closer to the solution, there is likely less iteration needed. 
]]

---

```r
fit_fa1 <- asreml(Yield ~ Trial, data=caige, trace=F,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial,1 ):Geno,
 rcov=~at(Trial):ar1(Column):ar1(Row),
 G.param=sv_fa1$gammas.table, R.param=sv_fa1$gammas.table)
vc_fa1 <- lucid::vc(fit_fa1) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))
```
<iframe src="html_extras/table9.html" width="100%" height="300!important" frameBorder="0"> </iframe>
]]
.column.bg-blue[.content[
### Percentage between genetic variance explained

```r
sum_fa1 <- myf::summary.fa(fit_fa1)
sum_fa1$gammas[[1]]$`site %vaf`
```

```
                fac_1
  Balaklava  51.67980
  Horsham    74.87380
  Junee      70.12005
  Narrabri   58.74877
  Northstar  11.26292
  Roseworthy 17.74837
  Toodyay    55.19469
```

```r
sum_fa1$gammas[[1]]$`total %vaf`
```

```
  [1] 52.26278
```

]]

---

```r
sv_fa2 <- asreml(Yield ~ Trial, data=caige,
 random=~at(Trial):Row + at(Trial):Column +
*  fa(Trial, 2):Geno, start.values=T,
 rcov=~at(Trial):ar1(Column):ar1(Row))

# need to change fa(Trial, 1) to fa(Trial, 2)
# so it will match up
vc_fa1 <- vc_fa1 %>% mutate(
  Gamma=ifelse(
    grepl("fa(Trial, 1)", Gamma, fixed=T), 
    gsub("fa(Trial, 1)", "fa(Trial, 2)", Gamma, fixed=T),
    Gamma))
# replacing some initial values from the FA1 model
sv_fa2$gammas.table <- sv_fa2$gammas.table %>% 
  left_join(vc_fa1, by="Gamma") %>%
  mutate(Value=ifelse(!is.na(component), component, Value)) %>% 
  select(Gamma, Value, Constraint)
```
]]
.column.bg-blue[.content.font_sm80[
* As the first factor only explained 52% of the overall between trial genetic variance, we proceed to increase the order of the FA model. 
* The threshold is arbitrary but you may use .yellow[80%].
* Recall for 7 trials, the .yellow[maximum order of the FA model] we allow for .yellow[is 3]. 
]]

---

```r
fit_fa2 <- asreml(Yield ~ Trial, data=caige, trace=F,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial, 2):Geno,
 rcov=~at(Trial):ar1(Column):ar1(Row),
 G.param=sv_fa2$gammas.table, R.param=sv_fa2$gammas.table)
fit_fa2 <- update(fit_fa2)
vc_fa2 <- lucid::vc(fit_fa2) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))
```
<iframe src="html_extras/table10.html" width="100%" height="300!important" frameBorder="0"> </iframe>
]]
.column.bg-blue[.content[
### Percentage between genetic variance explained

```r
sum_fa2 <- myf::summary.fa(fit_fa2)
sum_fa2$gammas[[1]]$`site %vaf`
```
<iframe src="html_extras/table11.html" width="100%" height="200!important" frameBorder="0"> </iframe>

```r
sum_fa2$gammas[[1]]$`total %vaf`
```

```
  [1] 64.3899
```

]]

---

```r
sv_fa3 <- asreml(Yield ~ Trial, data=caige,
 random=~at(Trial):Row + at(Trial):Column +
*  fa(Trial, 3):Geno, start.values=T,
 rcov=~at(Trial):ar1(Column):ar1(Row))

# need to change fa(Trial, 2) to fa(Trial, 3)
# so it will match up
vc_fa2 <- vc_fa2 %>% mutate(
  Gamma=ifelse(
    grepl("fa(Trial, 2)", Gamma, fixed=T), 
    gsub("fa(Trial, 2)", "fa(Trial, 3)", Gamma, fixed=T),
    Gamma))
# replacing some initial values from the FA1 model
sv_fa3$gammas.table <- sv_fa3$gammas.table %>% 
  left_join(vc_fa2, by="Gamma") %>%
  mutate(Value=ifelse(!is.na(component), component, Value)) %>% 
  select(Gamma, Value, Constraint)
```
]]
.column.bg-blue[.content.font_sm80[
* The two factors explain 64% of the overall between trial genetic variance.
* We increase the order of the FA model to the final possible one. 
* My coding is quite clumsy for such a repetitive task. I recommend you make into a function for many repetitive task such as this. 
]]

---

```r
fit_fa3 <- asreml(Yield ~ Trial, data=caige, trace=F,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial, 3):Geno,
 rcov=~at(Trial):ar1(Column):ar1(Row),
 G.param=sv_fa3$gammas.table, R.param=sv_fa3$gammas.table)
vc_fa3 <- lucid::vc(fit_fa3) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))
```
<iframe src="html_extras/table12.html" width="100%" height="300!important" frameBorder="0"> </iframe>
]]
.column.bg-blue[.content[
### Percentage between genetic variance explained

```r
sum_fa3 <- myf::summary.fa(fit_fa3)
sum_fa3$gammas[[1]]$`site %vaf`
```
<iframe src="html_extras/table13.html" width="100%" height="200!important" frameBorder="0"> </iframe>

```r
sum_fa3$gammas[[1]]$`total %vaf`
```

```
  [1] 71.5777
```

]]

---

```r
sum_fa3 <- myf::summary.fa(fit_fa3, heatmap.ord="cluster", g.list=c("G124", "G148", "G8", "G152", "G158", "G183"))
sum_fa3$heatmaps$`fa(Trial, 3):Geno`
```

![](day3-session02-FAmodel_files/figure-html/unnamed-chunk-24-1.png)
]]
.column.bg-blue[.content.font_sm80[

* The trials are ordred by `agnes` clustering algorithm based on the estimated between environment genetic correlation matrix.

```r
sum_fa3$gammas[[1]]$Cmat 
```
<iframe src="html_extras/table14.html" width="100%" height="350!important" frameBorder="0"> </iframe>
]]

---

```r
sum_fa3$regplots$`fa(Trial, 3):Geno`
```

![](day3-session02-FAmodel_files/figure-html/unnamed-chunk-26-1.png)

]]
.column.bg-blue[.content[

]]

---

.column[.content[
# Latent Regression Plots (`ggplot` style)
![](day3-session02-FAmodel_files/figure-html/unnamed-chunk-27-1.png)

]]
.column.bg-blue[.content[

```r
sum_fa3 <- myf::summary.fa(fit_fa3
```
To get the (regression) BLUP:

```r
sum_fa3$blups[[1]]$blups.inmet
```
To get the (rotated) scores:

```r
sum_fa3$blups[[1]]$scores.inmet
```
To get the rotated loadings:

```r
sum_fa3$gammas[[1]]$`rotated loads`
```
]]

---

]]
.column.bg-blue[.content[

]]

---

![](day3-session02-FAmodel_files/figure-html/unnamed-chunk-32-1.png)

]]
.column.bg-blue[.content.font_sm80[
* Your selection decision would be very different if chosen by a simple average. 
* In theory, modelling approaches should allow to delineate other sources of variation and provide more accurate and meaningful result for the aim of selection. 
]]

---

.column.bg-yellow[.content.font_sm80[
# Key References

* Smith et al. (2015) Factor analytic mixed models for the provision of grower information from national crop variety testing programs. *Theoretical and Applied Genetics* **128**(1) 55-72
* Cullis et al. (2014) Factor analytic and reduced animal models for the investigation of additive genotype-by-environment interaction in outcrossing plant species with application to a *Pinus radiata* breeding programme. *Theoretical and Applied Genetics* **127**(10) 2193-2210
* Mardia et al. (1995) Multivariate Analysis. Chapter 9.

]]
.column.bg-main1.white[.content[
# Notes

* Remember:
.center[
"All models are wrong but some are useful."
<span style="float:right">-George Box</span>
]
* Statistics can never compensate for poor experimental design or bad data. 
* Consulting with key experts (e.g. agronomists, trial managers, etc) if the model output makes sense.

]]

---

class: split-40 title-slide2 with-border white
background-image: url("images/bg3.jpg")
background-size: cover

]]
.column.shade_main[.content[
<br><br>

# <u>Slides</u>

These slides were made using the R package [`xaringan`](https://github.com/yihui/xaringan) with the [`ninja-themes`](https://github.com/emitanaka/ninja-theme) and is available at [`bit.ly/UT-WS-FAmodel`](http://bit.ly/UT-WS-FAmodel).

# <u>Your Turn</u>

<s>Download `day3-session02-FAmodel-tutorial.Rmd` here, open in RStudio, push the button "Run Document" on the top tab and work through the exercises.</s> 
For workshop participants, contact Emi for the tutorials.
<br><br>

### <a rel='license' href='http://creativecommons.org/licenses/by-sa/4.0/'><img alt='Creative Commons License' style='border-width:0; width:30pt' src='images/cc.svg' /><img alt='Creative Commons License' style='border-width:0; width:20pt' src='images/by.svg' /><img alt='Creative Commons License' style='border-width:0; width:20pt' src='images/sa.svg' /></a><span style='font-size:10pt'>&nbsp;&nbsp;&nbsp;&nbsp;This work by <span xmlns:cc='http://creativecommons.org/ns#' property='cc:attributionName'>Emi Tanaka</span> is licensed under a <a rel='license' href='http://creativecommons.org/licenses/by-sa/4.0/'>Creative Commons Attribution-ShareAlike 4.0 International License</a>.</span>

]]

Notes for current slide

Notes for next slide

Statistical Methods for Omics Assisted Breeding

Factor Analytic Model

Emi Tanaka
emi.tanaka@sydney.edu.au
School of Mathematics and Statisitcs

2018/11/14

These slides may take a while to render properly. You can find the pdf here.

This work by Emi Tanaka is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

1 / 35

Theory
2 / 35

MET Model

We saw before that we can fit a model that borrows strength across sites for a more accurate prediction of genotype by site effects:

fit2 <- asreml(Yield ~ Site, 
               random=~ us(Site):id(Geno),
               data=data1)

3 / 35

The general MET Model

Assume that there are m genotypes across t sites (not all genotypes appear in each site). We can write the model more generally as where we assume a separable variance structure for genotype-by-environment effects:

G_e may be assumed a general matrix such as unstructured matrix that is usually estimated from the data.
G_g is a known genotype relationship matrix.

4 / 35

Unstructured Covariance Model

The number of parameters to be estimated grows quadratically with the number of trials so it quickly becomes too many parameters to estimate.
Recall covariances are symmetric so there is no need to estimate the parameters in the upper (or lower) triangle of covariance matrices.

5 / 35

Factor Analytic Model

For some order k, you can replace the unstructured covariance with factor analytic form:

Due to identifiability, some contraints are applied to the loading matrix.
asreml constrains such that the upper triangle of the loading matrix are zeroes.
So in effect there are
(k + 1) t - k (k - 1) / 2
parameters to estimate.

6 / 35

The number of variance parameters to estimate
 

The number of variance parameters to estimate for FA model grows linearly with the number of trials.
FA model can be considered a lower order approximation to the US model.
As FA model is to offer a simpler model then it does not make sense to have more parameters to estimate in FA model than the US model.

7 / 35

Condition to use FA model over US model

Trial, site and environment are used synonymously.
We expect that t > k.

8 / 35

Latent Variable Model

FA Model is a special case of latent variable model when the responses are conditionally normally distributed.
Note: our FA model is different to the standard FA model due to the separable structure of G_ge.

9 / 35

Latent Variable Model

Notice that this is like a linear regression model except the covariates are estimated from the data.
The loadings represent some latent environmental covariate.
The common factor represent how the genotype responds to that covariate.
The specific factor represent an effect specific to that environment.

10 / 35

How to choose the order, k, of FA model?

Pragmatically, you can use some threshold for overall percentage of between genetic variances explained by the k factors:
You can use a hypothesis testing approach or use of information criterion.
You can use OFAL penalty proposed in Hui et al. (2018).

Hui et al. (2018) Order Selection and Sparsity in Latent Variable Models via the Ordered Factor LASSO. Biometrics

The first approach is akin to using coefficient of determination R² in linear regression.

11 / 35

Non-uniqueness of factor loadings

Suppose that we have an orthogonal matrix Q of size t.
If we do not impose some constraint as per slide 5, then there are many possible solutions for factor loadings.
In fact, the rotated loadings and factor are also solutions.

So what to do?

Many solutions. The approach we use in the practical session, we use an approach similar to principal components, in that:
- the first rotated factor accounts for the maximum amount of estimated genetic covariance,
- the second accounts for the next largest amount of estimated genetic covariance and so on.

A square matrix Q of size t is orthogonal if
If k=1, then there is no constraint applied and no rotation is necessary.

Steps:

Constrain upper triangle of loading matrix to zero.
Rotate the estimated loading matrix.

12 / 35

How is this different principal component analysis?

Good question!

The two are the same under certain case in fact (can you tell when?).

Principal component analysis (PCA) is a transformation of the data.
PCA transforms the variables to principal components.

Factor analytic model assumes that the data comes from a well-defined model where the underlying factors satisfy assumptions mentioned before.
The emphasis in factor analysis is that the factors map to the variables but specific factors are explicitly assumed to be "noise".

13 / 35

Between environment genetic correlation matrix

Negative genetic correlation estimate indicate cross-over interaction.
Positive genetic correlation estimate indicate noncross-over interaction.
Estimating the between environment genetic covariance is dependent on the number of varieties in common between trials (which we refer to as connectivity).

14 / 35

Notes on FA model

If trials are completely disconnected then between environment genetic covariance cannot be reliably computed.
The genetic regression residuals represent non-repeatable variety effects for the given the model and set of environments.

What set of trials for MET analysis?

You would want a representative sample of environments, both in a geographic and seasonal sense, a relevant set of varieties and reasonable connectivity between pairs of trials.

15 / 35

Practice
16 / 35

Example: CAIGE 2017 MET Bread Wheat Yield Trials

caige <- read.csv("tutorials/CAIGE2017.csv") %>% # change to your relative path
  mutate(Row=factor(Row), Column=factor(Column), Year=factor(Year)) 
skimr::skim(caige)

  Skim summary statistics
   n obs: 2484 
   n variables: 6 
  ── Variable type:factor ─────────────────────────────────────────────────────────────────────────────────────
   variable missing complete    n n_unique ordered
     Column       0     2484 2484       24   FALSE
       Geno       0     2484 2484      235   FALSE
        Row       0     2484 2484       32   FALSE
      Trial       0     2484 2484        7   FALSE
       Year       0     2484 2484        1   FALSE
  ── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────
   variable missing complete    n mean   sd   p0  p25  p50  p75 p100
      Yield       3     2481 2484 3.77 1.58 0.51 2.41 3.73 4.82 8.68
       hist
   ▁▇▆▇▆▂▂▁

Source: CIMMYT Australia ICARDA Germplasm Evaluation (CAIGE) Project

Shiny App to explore the CAIGE data

Yield is measured in tonnes per hectre.

17 / 35

Exploratory Data Analysis: Yield

gg <- desplot(Yield ~ Column + Row | Trial, data=caige,
              main="", gg=T) + theme_bw()
plotly::ggplotly(gg)

Get to know your data well before modelling.

Check for outliers and follow up with the data manager.

For this, we will assume the data is well behaved and there are no outliers to keep it simple, but of course, in practice that should not be the case.

18 / 35

EDA: Genotype Replication per Trial

tt <- table(caige$Geno, caige$Trial) 
head(tt, 2)


        Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
    G1          2       2     2        2         2          2       2
    G10         2       2     2        2         3          2       2

prep <- function(x) 100 * round(prop.table(table(factor(x, levels=1:3))), 3)
apply(tt, 2, prep)


      Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
    1      44.1    48.4  32.8     21.4      38.9       35.9    44.8
    2      54.0    51.6  66.8     77.8      59.3       64.1    55.2
    3       1.9     0.0   0.4      0.9       1.8        0.0     0.0

The trials employ a partially replicated design.

Most genotype is replicated 1-2 times at each trial with small portion replicated 3 times.

19 / 35

EDA: Genotype Concurrence Matrix

tt <- tt > 0
conc <- t(tt) %*% tt
conc


             Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
  Balaklava        213     190   213      213       213        213     201
  Horsham          190     190   190      190       190        190     190
  Junee            213     190   229      229       221        229     201
  Narrabri         213     190   229      234       221        233     201
  Northstar        213     190   221      221       221        221     201
  Roseworthy       213     190   229      233       221        234     201
  Toodyay          201     190   201      201       201        201     201

Sometimes simply referred to as "connectivity".

20 / 35

EDA: Genotype Concurrence Matrix (Prettier)

Diagonal entries show the number of genotypes at the corresponding trial.
Off diagonal entries show the number of genotypes common between the trials shown in the row and column labels.
The connectivity between all pairs of trial is good so we should not have much problem estimating the genetic covariance between trials.

21 / 35

Modelling: DIAG model
fit_diag <- asreml(Yield ~ Trial, data=caige, trace=F,
 random=~at(Trial):Row + at(Trial):Column +
   diag(Trial):Geno,
 rcov=~at(Trial):ar1(Column):ar1(Row))
fit_diag <- update(fit_diag)
vc_diag <- lucid::vc(fit_diag) %>% rename(Gamma=effect)

 

In the first instance, we will normally do spatial modelling however we have not here and assume that the addition of random Row and random Column effects along with separable autoregressive process of order one are sufficient.
Balaklava, Roseworthy and Toodyay exhibit concerns with low heritability. There could be more done (e.g. discussing with trial managers and experts) to address this point before proceeding with the FA model. 
For simplicity, we shall assume there is no concern and proceed on.

22 / 35

Modelling: Initialising the FA1 model
Recall bad starting values in Newton-Raphson can make convergence go stray.
When fitting complex model, we will build it up from simpler ones using the fit of the simpler models as initial values for complex ones. 

sv_fa1 <- asreml(Yield ~ Trial, data=caige,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial, 1):Geno, start.values=T,
 rcov=~at(Trial):ar1(Column):ar1(Row))
# replacing some initial values from the DIAG model
sv_fa1$gammas.table <- sv_fa1$gammas.table %>% 
  left_join(vc_diag, by="Gamma") %>%
  mutate(Value=ifelse(!is.na(component), component, Value)) %>% 
  select(Gamma, Value, Constraint)

This step is not absolutely critical but can be helpful in particular with larger data, as starting from a value closer to the solution, there is likely less iteration needed. 

23 / 35

Modelling: FA1 model
fit_fa1 <- asreml(Yield ~ Trial, data=caige, trace=F,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial,1 ):Geno,
 rcov=~at(Trial):ar1(Column):ar1(Row),
 G.param=sv_fa1$gammas.table, R.param=sv_fa1$gammas.table)
vc_fa1 <- lucid::vc(fit_fa1) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))

 
Percentage between genetic variance explained
sum_fa1 <- myf::summary.fa(fit_fa1)
sum_fa1$gammas[[1]]$`site %vaf`

                fac_1
  Balaklava  51.67980
  Horsham    74.87380
  Junee      70.12005
  Narrabri   58.74877
  Northstar  11.26292
  Roseworthy 17.74837
  Toodyay    55.19469
sum_fa1$gammas[[1]]$`total %vaf`

  [1] 52.26278
24 / 35

Modelling: initialising FA2 model
sv_fa2 <- asreml(Yield ~ Trial, data=caige,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial, 2):Geno, start.values=T,
 rcov=~at(Trial):ar1(Column):ar1(Row))
# need to change fa(Trial, 1) to fa(Trial, 2)
# so it will match up
vc_fa1 <- vc_fa1 %>% mutate(
  Gamma=ifelse(
    grepl("fa(Trial, 1)", Gamma, fixed=T), 
    gsub("fa(Trial, 1)", "fa(Trial, 2)", Gamma, fixed=T),
    Gamma))
# replacing some initial values from the FA1 model
sv_fa2$gammas.table <- sv_fa2$gammas.table %>% 
  left_join(vc_fa1, by="Gamma") %>%
  mutate(Value=ifelse(!is.na(component), component, Value)) %>% 
  select(Gamma, Value, Constraint)

As the first factor only explained 52% of the overall between trial genetic variance, we proceed to increase the order of the FA model. 
The threshold is arbitrary but you may use 80%.
Recall for 7 trials, the maximum order of the FA model we allow for is 3. 

25 / 35

Modelling: FA2 model
fit_fa2 <- asreml(Yield ~ Trial, data=caige, trace=F,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial, 2):Geno,
 rcov=~at(Trial):ar1(Column):ar1(Row),
 G.param=sv_fa2$gammas.table, R.param=sv_fa2$gammas.table)
fit_fa2 <- update(fit_fa2)
vc_fa2 <- lucid::vc(fit_fa2) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))

 
Percentage between genetic variance explained
sum_fa2 <- myf::summary.fa(fit_fa2)
sum_fa2$gammas[[1]]$`site %vaf`

 


sum_fa2$gammas[[1]]$`total %vaf`

  [1] 64.3899
26 / 35

Modelling: initialising FA3 model
sv_fa3 <- asreml(Yield ~ Trial, data=caige,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial, 3):Geno, start.values=T,
 rcov=~at(Trial):ar1(Column):ar1(Row))
# need to change fa(Trial, 2) to fa(Trial, 3)
# so it will match up
vc_fa2 <- vc_fa2 %>% mutate(
  Gamma=ifelse(
    grepl("fa(Trial, 2)", Gamma, fixed=T), 
    gsub("fa(Trial, 2)", "fa(Trial, 3)", Gamma, fixed=T),
    Gamma))
# replacing some initial values from the FA1 model
sv_fa3$gammas.table <- sv_fa3$gammas.table %>% 
  left_join(vc_fa2, by="Gamma") %>%
  mutate(Value=ifelse(!is.na(component), component, Value)) %>% 
  select(Gamma, Value, Constraint)

The two factors explain 64% of the overall between trial genetic variance.
We increase the order of the FA model to the final possible one. 
My coding is quite clumsy for such a repetitive task. I recommend you make into a function for many repetitive task such as this. 

27 / 35

Modelling: FA3 model
fit_fa3 <- asreml(Yield ~ Trial, data=caige, trace=F,
 random=~at(Trial):Row + at(Trial):Column +
   fa(Trial, 3):Geno,
 rcov=~at(Trial):ar1(Column):ar1(Row),
 G.param=sv_fa3$gammas.table, R.param=sv_fa3$gammas.table)
vc_fa3 <- lucid::vc(fit_fa3) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))

 
Percentage between genetic variance explained
sum_fa3 <- myf::summary.fa(fit_fa3)
sum_fa3$gammas[[1]]$`site %vaf`

 


sum_fa3$gammas[[1]]$`total %vaf`

  [1] 71.5777
28 / 35

Between Environment Genetic Correlation

sum_fa3 <- myf::summary.fa(fit_fa3, heatmap.ord="cluster", g.list=c("G124", "G148", "G8", "G152", "G158", "G183"))
sum_fa3$heatmaps$`fa(Trial, 3):Geno`

The trials are ordred by agnes clustering algorithm based on the estimated between environment genetic correlation matrix.

sum_fa3$gammas[[1]]$Cmat

29 / 35

Latent Regression Plots

sum_fa3$regplots$`fa(Trial, 3):Geno`

30 / 35

Latent Regression Plots (`ggplot` style)

sum_fa3 <- myf::summary.fa(fit_fa3

To get the (regression) BLUP:

sum_fa3$blups[[1]]$blups.inmet

To get the (rotated) scores:

sum_fa3$blups[[1]]$scores.inmet

To get the rotated loadings:

sum_fa3$gammas[[1]]$`rotated loads`

31 / 35

Top 3 and Bottom 3 Genotypes
 
32 / 35

Raw Yield Mean vs. Regression BLUP

Your selection decision would be very different if chosen by a simple average.
In theory, modelling approaches should allow to delineate other sources of variation and provide more accurate and meaningful result for the aim of selection.

33 / 35

Key References

Smith et al. (2015) Factor analytic mixed models for the provision of grower information from national crop variety testing programs. Theoretical and Applied Genetics 128(1) 55-72
Cullis et al. (2014) Factor analytic and reduced animal models for the investigation of additive genotype-by-environment interaction in outcrossing plant species with application to a Pinus radiata breeding programme. Theoretical and Applied Genetics 127(10) 2193-2210
Mardia et al. (1995) Multivariate Analysis. Chapter 9.

Notes

Remember:
"All models are wrong but some are useful." -George Box
Statistics can never compensate for poor experimental design or bad data.
Consulting with key experts (e.g. agronomists, trial managers, etc) if the model output makes sense.

34 / 35

Slides

These slides were made using the R package xaringan with the ninja-themes and is available at bit.ly/UT-WS-FAmodel.

Your Turn

~~Download day3-session02-FAmodel-tutorial.Rmd here, open in RStudio, push the button "Run Document" on the top tab and work through the exercises.~~ For workshop participants, contact Emi for the tutorials.

This work by Emi Tanaka is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

35 / 35

Theory
2 / 35

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow

Statistical Methods for Omics Assisted Breeding

Factor Analytic Model

Emi Tanaka emi.tanaka@sydney.edu.au School of Mathematics and Statisitcs

2018/11/14

These slides may take a while to render properly. You can find the pdf here.

This work by Emi Tanaka is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Theory

MET Model

The general MET Model

Unstructured Covariance Model

Factor Analytic Model

The number of variance parameters to estimate

Condition to use FA model over US model

Latent Variable Model

Latent Variable Model

How to choose the order, k, of FA model?

Non-uniqueness of factor loadings

How is this different principal component analysis?

Between environment genetic correlation matrix

Notes on FA model

What set of trials for MET analysis?

Practice

Example: CAIGE 2017 MET Bread Wheat Yield Trials

Exploratory Data Analysis: Yield

EDA: Genotype Replication per Trial

EDA: Genotype Concurrence Matrix

EDA: Genotype Concurrence Matrix (Prettier)

Modelling: DIAG model

Modelling: Initialising the FA1 model

Modelling: FA1 model

Percentage between genetic variance explained

Modelling: initialising FA2 model

Modelling: FA2 model

Percentage between genetic variance explained

Modelling: initialising FA3 model

Modelling: FA3 model

Percentage between genetic variance explained

Between Environment Genetic Correlation

Latent Regression Plots

Latent Regression Plots (ggplot style)

Top 3 and Bottom 3 Genotypes

Raw Yield Mean vs. Regression BLUP

Key References

Notes

Slides

Your Turn

This work by Emi Tanaka is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Theory

Help

Emi Tanaka
emi.tanaka@sydney.edu.au
School of Mathematics and Statisitcs

Latent Regression Plots (`ggplot` style)