Symbolic formulae for linear mixed models

The R Consortium Project

Author

Affiliation

Emi Tanaka

 

Published

March 6, 2019

Citation

Tanaka, 2019

The project is in colloboration with Dr. Francis Hui (ANU) and Dr. Max Kuhn (RStudio).

The Problem

Symbolic model formulae define the structural component of a statistical model in an easier and often more accessible terms for practitioners. The earlier instance of symbolic model formulae for linear models was applied in Genstat with further generalisation by . describe the symbolic model formulae implementation for linear models in the S language which remains much the same in the R language .

Linear mixed models (LMMs) are widely used across many disciplines (e.g. ecology, psychology, agriculture, finance etc) due to its flexibility to model complex, correlated structures in the data. With over 10,000 citations and 5.3 million total downloads on CRAN, lme4 is arguably the most popular R-package to fit LMMs. Another popular R-package that fit LMMs with flexible covariance structures is asreml , which wraps the proprietary software ASreml into the R framework. asreml with its core algorithm collectively has over 4,000 citations and remains popular, in particular for the analysis of plant breeding trials, due to the ease of fitting flexible covariance structures despite the license cost for its continued use.

While the symbolic formula of linear models generally have a consistent representation and evaluation rule as implemented in stats::formula, this is not the case for LMMs (and mixed models more generally). The inconsistency of symbolic formulae arises mainly in the representation of random effects, with the additional need to specify the variance-covariance structure of the random effects as well as structure of the associated model matrix that governs how the random effects are mapped to (groups of) the observational units. For example, asreml::asreml have separate formulation of fixed and random effects while lme4::lmer have a single formula that includes a mixture of fixed and random effects. Further differences with motivating examples are shown under The Plan. The differences give rise to confusion of equivalent model specification in different R-packages.

The lack of consistency in symbolic formula and model representation across mixed model software motivates the need to formulate a unified symbolic model formulae for LMMs with: (1) extension of the evaluation rules described in ; and (2) ease of comprehension of the specified model for the user. This symbolic model formulae can be a basis for creating a common API to mixed models with wrappers to popular mixed model R-packages (with initial focus on lme4 and asreml), thereby achieving a similar feat to parsnip R-package which implements a tidy unified interface to many predictive modelling functions (e.g. random forest, logistic regression, survival models etc).

The Plan

Overview

We propose to implement a unified symbolic model formulae for LMMs with extended evaluation rules relevant to a plethora of different variance-covariance structures for the random effects, as based on continuous consultation and feedback from the community. Additionally, we propose to extend the parsnip R-package to implement a tidy unified interface to LMMs. The proposal will fill a growing demand in the current software market for a much needed (1) unified symbolic model formulae and its evaluation rules of LMMs with wrappers for popular R-packages, such as lme4 and asreml, that fit LMMs; and (2) a markup language output of the mathematical notation of the specified model.

Details

The translation of mixed model specification in different R-packages is not straightforward where the main difference in the specification of the random effects. In this section, we illustrate some key differences in the symbolic model formulae for lme4 and asreml by considering two motivating examples as below. Note that there are range of models that can only be fitted in asreml or take great user effort to fit in lme4. Further examples will be progressively added here when the author has more time.

Example 1: Plant Breeding

We consider a common experiment in plant breeding where m crop varieties are tested at t sites. We present the analysis of this multi-environmental trial as

yield ~ site + site:variety

where site:variety is modelled as a random effect . Random effects, unlike standard linear models, require additional specification of its variance-covariance structure. In the first instance, we assume that site-by-variety effect is Gaussian, independent and identically distributed (i.i.d.). This model is fitted as below in lme4 and asreml.

Model 1 i.i.d. var(site:variety)=σ2svImt
lme4 lmer(yield ~ site + (1|site:variety))
asreml asreml(yield ~ site, random=~idv(site):id(variety))

where σ2sv is the variance component that will be estimated from the data and Imt is an mt×mt identity matrix. Here for asreml, idv(site) specifies the structure σ2svIt, id(site) specifies the structure Im and idv(site):id(variety) equates to σ2svItIm=σ2svImt. For lme4, (1|site:variety) specifies that we fit a random intercept for every site-by-variety effect with a common variance.

As the same crop variety are grown across multiple sites, it maybe more realistic to consider that the performance of the same variety will be correlated across sites. This in esssence results in the borrowing of strength across sites to achieve improved prediction in site-by-variety effect. In this instance we fit the model with the variance-covariance structure of site-by-variety effect as ΣtIm, where Σt is a t×t unstructured matrix.

Model 2 Borrowing strength across sites var(site:variety)=ΣtIm
lme4 lmer(yield ~ site + (site - 1|variety))
asreml asreml(yield ~ site, random=~us(site):id(variety))

Here the symbolic formulation in asreml is straight forward, however, lme4 requires further explanation. For every level of the grouping factor specified after the | (variety here), site - 1 fits a random (slope) effect for each site removing the overall intercept effect. These random slope effect will be correlated within each level of variety.

Example 2: Repeated Measures

Consider an example where we have m subjects, each of which we measure some quantitative response over t regular time points. We fit the so-called random intercept and slope model yij=β0+β1xj+b0i+b1ixj+ϵij,

where yij is the response of subject i at the j-th time point; β0 is the overall intercept; β1 is the overall slope for time; b0i and b1i are the random intercept and the random time slope for subject i; and xj is the j-th time point (or a time-varying covariate e.g., age or weight); and ϵij is the random error. In this model we assume that [b0ib1iϵij]N([000],[σ20σ010σ01σ21000σ2]).
This is fitted as follows.

Model 3 Random intercept and slope model
lme4 lmer(response ~ 1 + (Time|Subject))
asreml asreml(response ~ 1, random=~str(Subject + Subject:Time, ~us(2):id(m))

Acknowledgments

This article is written using radix using RStudio IDE and statistical computing tool R.

Reuse

Text are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/emitanaka/r/_posts.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Footnotes

    References

    1. Symbolic Description of Factorial Models for Analysis of Variance
      Wilkinson, G.N. and Rogers, C.E., 1973. Journal of the Royal Statistical Society: Series C (Applied Statistics), Vol 22(3), pp. 392--399.
    2. Statistical Models in S
      Chambers, J.M. and Hastie, T.J., 1993. , pp. 227. . DOI: 10.2307/1269676
    3. An Introduction to R
      Venables, W.N., Smith, D.M. and Team, T.R.C., 2018.
    4. Fitting Linear Mixed-Effects Models using lme4
      Bates, D., Machler, M., Bolker, B. and Walker, S., 2015. Journal of Statistical Software, Vol 67(1). DOI: 10.18637/jss.v067.i01
    5. Mixed models for S language environments ASReml-R reference manual
      Butler, D.G., Cullis, B.R., Gilmour, A.R. and Gogel, B.J., 2009.
    6. {ASReml user guide release 3.0}
      Gilmour, A.R., Gogel, B.J., Cullis, B.R. and Thompson, R., 2009.
    7. parsnip: A Common API to Modeling and analysis Functions[link]
      Kuhn, M., 2018.
    8. That BLUP is a Good Thing : The Estimation of Random Effects
      Robinson, G.K., 1991. Statistical Science, Vol 6(1), pp. 15--32.
    9. radix: 'R Markdown' Format for Scientific and Technical Writing[link]
      Allaire, J., Iannone, R. and Xie, Y., 2018.
    10. R: A Language and Environment for Statistical Computing[link]
      R_Core_Team, ., 2018.

    Citation

    For attribution, please cite this work as

    Tanaka (2019, March 7). Savvy Statistics: Symbolic formulae for linear mixed models. Retrieved from https://emitanaka.github.io/r/posts/2019-03-07-symbolic-formulae-for-linear-mixed-models/

    BibTeX citation

    @misc{symboliclmm,
      author = {Tanaka, Emi},
      title = {Savvy Statistics: Symbolic formulae for linear mixed models},
      url = {https://emitanaka.github.io/r/posts/2019-03-07-symbolic-formulae-for-linear-mixed-models/},
      year = {2019}
    }