+ - 0:00:00
Notes for current slide
Notes for next slide



Statistical Methods for Omics Assisted Breeding

Factor Analytic Model


Emi Tanaka
emi.tanaka@sydney.edu.au
School of Mathematics and Statisitcs

2018/11/14

These slides may take a while to render properly. You can find the pdf here.

Creative Commons LicenseCreative Commons LicenseCreative Commons License    This work by Emi Tanaka is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

1 / 35

Theory

2 / 35

MET Model

We saw before that we can fit a model that borrows strength across sites for a more accurate prediction of genotype by site effects:

fit2 <- asreml(Yield ~ Site,
random=~ us(Site):id(Geno),
data=data1)
3 / 35

The general MET Model

Assume that there are m genotypes across t sites (not all genotypes appear in each site). We can write the model more generally as where we assume a separable variance structure for genotype-by-environment effects:

  • Ge may be assumed a general matrix such as unstructured matrix that is usually estimated from the data.
  • Gg is a known genotype relationship matrix.
4 / 35

Unstructured Covariance Model


  • The number of parameters to be estimated grows quadratically with the number of trials so it quickly becomes too many parameters to estimate.
  • Recall covariances are symmetric so there is no need to estimate the parameters in the upper (or lower) triangle of covariance matrices.
5 / 35

Factor Analytic Model

For some order k, you can replace the unstructured covariance with factor analytic form:

  • Due to identifiability, some contraints are applied to the loading matrix.
  • asreml constrains such that the upper triangle of the loading matrix are zeroes.
  • So in effect there are
    (k + 1) t - k (k - 1) / 2
    parameters to estimate.
6 / 35

The number of variance parameters to estimate

  • The number of variance parameters to estimate for FA model grows linearly with the number of trials.
  • FA model can be considered a lower order approximation to the US model.
  • As FA model is to offer a simpler model then it does not make sense to have more parameters to estimate in FA model than the US model.
7 / 35

Condition to use FA model over US model

  • Trial, site and environment are used synonymously.

  • We expect that t > k.

8 / 35

Latent Variable Model


  • FA Model is a special case of latent variable model when the responses are conditionally normally distributed.
  • Note: our FA model is different to the standard FA model due to the separable structure of Gge.
9 / 35

Latent Variable Model


  • Notice that this is like a linear regression model except the covariates are estimated from the data.
  • The loadings represent some latent environmental covariate.
  • The common factor represent how the genotype responds to that covariate.
  • The specific factor represent an effect specific to that environment.
10 / 35

How to choose the order, k, of FA model?

  1. Pragmatically, you can use some threshold for overall percentage of between genetic variances explained by the k factors:
  2. You can use a hypothesis testing approach or use of information criterion.
  3. You can use OFAL penalty proposed in Hui et al. (2018).

Hui et al. (2018) Order Selection and Sparsity in Latent Variable Models via the Ordered Factor LASSO. Biometrics

  • The first approach is akin to using coefficient of determination R2 in linear regression.
11 / 35

Non-uniqueness of factor loadings

  • Suppose that we have an orthogonal matrix Q of size t.
  • If we do not impose some constraint as per slide 5, then there are many possible solutions for factor loadings.
  • In fact, the rotated loadings and factor are also solutions.

So what to do?

  • Many solutions. The approach we use in the practical session, we use an approach similar to principal components, in that:
    • the first rotated factor accounts for the maximum amount of estimated genetic covariance,
    • the second accounts for the next largest amount of estimated genetic covariance and so on.
  • A square matrix Q of size t is orthogonal if
  • If k=1, then there is no constraint applied and no rotation is necessary.

Steps:

  1. Constrain upper triangle of loading matrix to zero.
  2. Rotate the estimated loading matrix.
12 / 35

How is this different principal component analysis?

Good question!

The two are the same under certain case in fact (can you tell when?).

  • Principal component analysis (PCA) is a transformation of the data.
  • PCA transforms the variables to principal components.
  • Factor analytic model assumes that the data comes from a well-defined model where the underlying factors satisfy assumptions mentioned before.
  • The emphasis in factor analysis is that the factors map to the variables but specific factors are explicitly assumed to be "noise".
13 / 35

Between environment genetic correlation matrix

  • Negative genetic correlation estimate indicate cross-over interaction.
  • Positive genetic correlation estimate indicate noncross-over interaction.
  • Estimating the between environment genetic covariance is dependent on the number of varieties in common between trials (which we refer to as connectivity).
14 / 35

Notes on FA model

  • If trials are completely disconnected then between environment genetic covariance cannot be reliably computed.

  • The genetic regression residuals represent non-repeatable variety effects for the given the model and set of environments.

What set of trials for MET analysis?

  • You would want a representative sample of environments, both in a geographic and seasonal sense, a relevant set of varieties and reasonable connectivity between pairs of trials.
15 / 35

Practice

16 / 35

Example: CAIGE 2017 MET Bread Wheat Yield Trials

caige <- read.csv("tutorials/CAIGE2017.csv") %>% # change to your relative path
mutate(Row=factor(Row), Column=factor(Column), Year=factor(Year))
skimr::skim(caige)
Skim summary statistics
n obs: 2484
n variables: 6
── Variable type:factor ─────────────────────────────────────────────────────────────────────────────────────
variable missing complete n n_unique ordered
Column 0 2484 2484 24 FALSE
Geno 0 2484 2484 235 FALSE
Row 0 2484 2484 32 FALSE
Trial 0 2484 2484 7 FALSE
Year 0 2484 2484 1 FALSE
── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100
Yield 3 2481 2484 3.77 1.58 0.51 2.41 3.73 4.82 8.68
hist
▁▇▆▇▆▂▂▁
17 / 35

Exploratory Data Analysis: Yield

gg <- desplot(Yield ~ Column + Row | Trial, data=caige,
main="", gg=T) + theme_bw()
plotly::ggplotly(gg)

Get to know your data well before modelling.



Check for outliers and follow up with the data manager.



For this, we will assume the data is well behaved and there are no outliers to keep it simple, but of course, in practice that should not be the case.

18 / 35

EDA: Genotype Replication per Trial

tt <- table(caige$Geno, caige$Trial)
head(tt, 2)
Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
G1 2 2 2 2 2 2 2
G10 2 2 2 2 3 2 2
prep <- function(x) 100 * round(prop.table(table(factor(x, levels=1:3))), 3)
apply(tt, 2, prep)
Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
1 44.1 48.4 32.8 21.4 38.9 35.9 44.8
2 54.0 51.6 66.8 77.8 59.3 64.1 55.2
3 1.9 0.0 0.4 0.9 1.8 0.0 0.0

The trials employ a partially replicated design.



Most genotype is replicated 1-2 times at each trial with small portion replicated 3 times.

19 / 35

EDA: Genotype Concurrence Matrix

tt <- tt > 0
conc <- t(tt) %*% tt
conc
Balaklava Horsham Junee Narrabri Northstar Roseworthy Toodyay
Balaklava 213 190 213 213 213 213 201
Horsham 190 190 190 190 190 190 190
Junee 213 190 229 229 221 229 201
Narrabri 213 190 229 234 221 233 201
Northstar 213 190 221 221 221 221 201
Roseworthy 213 190 229 233 221 234 201
Toodyay 201 190 201 201 201 201 201

Sometimes simply referred to as "connectivity".

20 / 35

EDA: Genotype Concurrence Matrix (Prettier)

  • Diagonal entries show the number of genotypes at the corresponding trial.
  • Off diagonal entries show the number of genotypes common between the trials shown in the row and column labels.
  • The connectivity between all pairs of trial is good so we should not have much problem estimating the genetic covariance between trials.
21 / 35

Modelling: DIAG model

fit_diag <- asreml(Yield ~ Trial, data=caige, trace=F,
random=~at(Trial):Row + at(Trial):Column +
diag(Trial):Geno,
rcov=~at(Trial):ar1(Column):ar1(Row))
fit_diag <- update(fit_diag)
vc_diag <- lucid::vc(fit_diag) %>% rename(Gamma=effect)
  • In the first instance, we will normally do spatial modelling however we have not here and assume that the addition of random Row and random Column effects along with separable autoregressive process of order one are sufficient.
  • Balaklava, Roseworthy and Toodyay exhibit concerns with low heritability. There could be more done (e.g. discussing with trial managers and experts) to address this point before proceeding with the FA model.
  • For simplicity, we shall assume there is no concern and proceed on.
22 / 35

Modelling: Initialising the FA1 model

  • Recall bad starting values in Newton-Raphson can make convergence go stray.
  • When fitting complex model, we will build it up from simpler ones using the fit of the simpler models as initial values for complex ones.
sv_fa1 <- asreml(Yield ~ Trial, data=caige,
random=~at(Trial):Row + at(Trial):Column +
fa(Trial, 1):Geno, start.values=T,
rcov=~at(Trial):ar1(Column):ar1(Row))
# replacing some initial values from the DIAG model
sv_fa1$gammas.table <- sv_fa1$gammas.table %>%
left_join(vc_diag, by="Gamma") %>%
mutate(Value=ifelse(!is.na(component), component, Value)) %>%
select(Gamma, Value, Constraint)
  • This step is not absolutely critical but can be helpful in particular with larger data, as starting from a value closer to the solution, there is likely less iteration needed.
23 / 35

Modelling: FA1 model

fit_fa1 <- asreml(Yield ~ Trial, data=caige, trace=F,
random=~at(Trial):Row + at(Trial):Column +
fa(Trial,1 ):Geno,
rcov=~at(Trial):ar1(Column):ar1(Row),
G.param=sv_fa1$gammas.table, R.param=sv_fa1$gammas.table)
vc_fa1 <- lucid::vc(fit_fa1) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))

Percentage between genetic variance explained

sum_fa1 <- myf::summary.fa(fit_fa1)
sum_fa1$gammas[[1]]$`site %vaf`
fac_1
Balaklava 51.67980
Horsham 74.87380
Junee 70.12005
Narrabri 58.74877
Northstar 11.26292
Roseworthy 17.74837
Toodyay 55.19469
sum_fa1$gammas[[1]]$`total %vaf`
[1] 52.26278
24 / 35

Modelling: initialising FA2 model

sv_fa2 <- asreml(Yield ~ Trial, data=caige,
random=~at(Trial):Row + at(Trial):Column +
fa(Trial, 2):Geno, start.values=T,
rcov=~at(Trial):ar1(Column):ar1(Row))
# need to change fa(Trial, 1) to fa(Trial, 2)
# so it will match up
vc_fa1 <- vc_fa1 %>% mutate(
Gamma=ifelse(
grepl("fa(Trial, 1)", Gamma, fixed=T),
gsub("fa(Trial, 1)", "fa(Trial, 2)", Gamma, fixed=T),
Gamma))
# replacing some initial values from the FA1 model
sv_fa2$gammas.table <- sv_fa2$gammas.table %>%
left_join(vc_fa1, by="Gamma") %>%
mutate(Value=ifelse(!is.na(component), component, Value)) %>%
select(Gamma, Value, Constraint)
  • As the first factor only explained 52% of the overall between trial genetic variance, we proceed to increase the order of the FA model.
  • The threshold is arbitrary but you may use 80%.
  • Recall for 7 trials, the maximum order of the FA model we allow for is 3.
25 / 35

Modelling: FA2 model

fit_fa2 <- asreml(Yield ~ Trial, data=caige, trace=F,
random=~at(Trial):Row + at(Trial):Column +
fa(Trial, 2):Geno,
rcov=~at(Trial):ar1(Column):ar1(Row),
G.param=sv_fa2$gammas.table, R.param=sv_fa2$gammas.table)
fit_fa2 <- update(fit_fa2)
vc_fa2 <- lucid::vc(fit_fa2) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))

Percentage between genetic variance explained

sum_fa2 <- myf::summary.fa(fit_fa2)
sum_fa2$gammas[[1]]$`site %vaf`
sum_fa2$gammas[[1]]$`total %vaf`
[1] 64.3899
26 / 35

Modelling: initialising FA3 model

sv_fa3 <- asreml(Yield ~ Trial, data=caige,
random=~at(Trial):Row + at(Trial):Column +
fa(Trial, 3):Geno, start.values=T,
rcov=~at(Trial):ar1(Column):ar1(Row))
# need to change fa(Trial, 2) to fa(Trial, 3)
# so it will match up
vc_fa2 <- vc_fa2 %>% mutate(
Gamma=ifelse(
grepl("fa(Trial, 2)", Gamma, fixed=T),
gsub("fa(Trial, 2)", "fa(Trial, 3)", Gamma, fixed=T),
Gamma))
# replacing some initial values from the FA1 model
sv_fa3$gammas.table <- sv_fa3$gammas.table %>%
left_join(vc_fa2, by="Gamma") %>%
mutate(Value=ifelse(!is.na(component), component, Value)) %>%
select(Gamma, Value, Constraint)
  • The two factors explain 64% of the overall between trial genetic variance.
  • We increase the order of the FA model to the final possible one.
  • My coding is quite clumsy for such a repetitive task. I recommend you make into a function for many repetitive task such as this.
27 / 35

Modelling: FA3 model

fit_fa3 <- asreml(Yield ~ Trial, data=caige, trace=F,
random=~at(Trial):Row + at(Trial):Column +
fa(Trial, 3):Geno,
rcov=~at(Trial):ar1(Column):ar1(Row),
G.param=sv_fa3$gammas.table, R.param=sv_fa3$gammas.table)
vc_fa3 <- lucid::vc(fit_fa3) %>% rename(Gamma=effect) %>% mutate(Gamma=as.character(Gamma))

Percentage between genetic variance explained

sum_fa3 <- myf::summary.fa(fit_fa3)
sum_fa3$gammas[[1]]$`site %vaf`
sum_fa3$gammas[[1]]$`total %vaf`
[1] 71.5777
28 / 35

Between Environment Genetic Correlation

sum_fa3 <- myf::summary.fa(fit_fa3, heatmap.ord="cluster", g.list=c("G124", "G148", "G8", "G152", "G158", "G183"))
sum_fa3$heatmaps$`fa(Trial, 3):Geno`

  • The trials are ordred by agnes clustering algorithm based on the estimated between environment genetic correlation matrix.
sum_fa3$gammas[[1]]$Cmat
29 / 35

Latent Regression Plots

sum_fa3$regplots$`fa(Trial, 3):Geno`

30 / 35

Latent Regression Plots (ggplot style)

sum_fa3 <- myf::summary.fa(fit_fa3

To get the (regression) BLUP:

sum_fa3$blups[[1]]$blups.inmet

To get the (rotated) scores:

sum_fa3$blups[[1]]$scores.inmet

To get the rotated loadings:

sum_fa3$gammas[[1]]$`rotated loads`
31 / 35

Top 3 and Bottom 3 Genotypes

32 / 35

Raw Yield Mean vs. Regression BLUP

  • Your selection decision would be very different if chosen by a simple average.
  • In theory, modelling approaches should allow to delineate other sources of variation and provide more accurate and meaningful result for the aim of selection.
33 / 35

Key References

  • Smith et al. (2015) Factor analytic mixed models for the provision of grower information from national crop variety testing programs. Theoretical and Applied Genetics 128(1) 55-72
  • Cullis et al. (2014) Factor analytic and reduced animal models for the investigation of additive genotype-by-environment interaction in outcrossing plant species with application to a Pinus radiata breeding programme. Theoretical and Applied Genetics 127(10) 2193-2210
  • Mardia et al. (1995) Multivariate Analysis. Chapter 9.

Notes

  • Remember:

    "All models are wrong but some are useful." -George Box

  • Statistics can never compensate for poor experimental design or bad data.
  • Consulting with key experts (e.g. agronomists, trial managers, etc) if the model output makes sense.
34 / 35



Slides

These slides were made using the R package xaringan with the ninja-themes and is available at bit.ly/UT-WS-FAmodel.

Your Turn

Download day3-session02-FAmodel-tutorial.Rmd here, open in RStudio, push the button "Run Document" on the top tab and work through the exercises. For workshop participants, contact Emi for the tutorials.

Creative Commons LicenseCreative Commons LicenseCreative Commons License    This work by Emi Tanaka is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

35 / 35

Theory

2 / 35
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow