How should you fit mixed models when using ASReml for data analysis?

How should you fit mixed models when using ASReml for data analysis?

Dr. Arthur Gilmour

24 May 2022
image_blog

ASReml caters for a wide range of linear mixed models. Some models are complex, involving hundreds of diverse variance components, with each model having its own characteristics, and some involving special ‘tricks’ in their specification.  

In addition, some datasets are enormous, with millions of fixed and random effects requiring the use of efficient computational tools, such as those exploiting sparsity. In the following sections we describe some of these challenges and provide some details to consider when choosing your model. 

Which model should you choose for data analysis with ASReml?

The model you choose for analysing data will vary depending on a number of factors, which include: 

  • Software processing time 
  • Software processing efficiency  
  • Incremental complexity 
  • Size of your data set

Read on to find out how you should choose your model in data analysis, and the potential risks associated with choosing the wrong mixed model. 

What should you consider when choosing a model for data analysis?

There are some widely used, well-founded mixed models, such as the animal model, the sire model, maternal model, and spatial analysis, as well as basic regression, factorial and (in)complete block analyses. These become the building blocks for multi-trait and multi-environment analyses.  

Each data set has its own characteristics, and these should always be recognised. For example, litters in extensively grazed sheep are not biologically equivalent to litters in intensively housed pigs! Read on to find out which elements you should consider when choosing a mixed model for data analysis. 

Software processing time

Gilmour (2019, DOI: 10.1111/jbg.12398) describes the process of fitting linear mixed models as implemented in ASReml. That paper highlights the principle of minimising the amount of computation by the use of sparsity, an important requirement for fast processing.  

 A second requirement is the efficiency of numerical calculations, and this has been significantly enhanced in ASReml 4.2, in particular by rearranging the computation. Consequently, this build of stand-alone ASReml is up to 100 times faster than ASReml 4.1 (releasedin 2019). 

Software processing efficiency

In specifying a model in ASReml, the terms are organised in 3 groups.  The first group is for small fixed factors for which the order of fitting is important because of the desire to perform significance testing using Wald F statistics. The equations for these are likely to be relatively dense, especially after adjusting for the other terms in the model, but there are typically less than 1000 of them.  

The second group is for random factors; a variance structure is declared for each.  

The third group is for large fixed effects (no variance structure). After forming the mixed model equations, ASReml reorders the equations in the last 2 groups to retain as much sparsity as possible.  

More sparsity means less computation (fewer elements of the sparse inverse to compute). Generally, this reordering has no impact on the answers (effects estimated) unless the fixed model is over-parameterised, in which case different equations may be declared singular. 

More recent developments in ASReml include storing and processing the sparse matrices more efficiently and changing the order of computation to reduce reading and writing to memory. It can also successfully access more RAM, allowing even larger models to be fitted. 

Incremental complexity

It is important to perform univariate/single site analyses first before considering multivariate/multi-environment analyses since the latter will fail to estimate covariance if the former fails to establish there is variance.  

In a recent case, a user wanted to fit a trivariate animal model with repeated records. The trivariate analysis failed. The model involved additive genetic effects (animal effects estimated with a numerator relationship matrix), permanent environment (PE) effects (uncorrelated animal effects), fixed contemporary group effects and correlated residuals. However, for the second trait, there was no PE variance and for the third trait, there were no repeat observations although the data was set up with the animal effect repeated.  

Consequently, with no ‘residual’ (sampling) variance, univariate analysis of the third trait failed. After modifying the data file so that the third trait only appeared once for each animal, the trivariate analysis was performed with 3 variances and covariances at the genetic level, 2 variances and a covariance at the animal level (1 and 3) and 2 variances and a covariance at the sampling level (1 and 2).  

Data set size

Many datasets are quite big and consequently models become quite large. The issue then becomes one of sparsity. With field trials, a spatial analysis based on an autoregressive correlation of plots within rows and within columns has become common because it accommodates most field correlation patterns well and is sparse in that only immediate neighbours are connected in the inverse residual variance matrix.  

Similarly, very large data sets can be analysed using the numerator relationship matrix to define genetic links, because the inverse is sparse with links between parents and their offspring and between parents with common offspring. This sparsity is exploited successfully by ASReml. 

Similarly, if we want a covariance matrix across many traits/environments, it is likely to be over parameterised. The Factor Analytic variance structure allows the matrix to be estimated with reduced parameterization (in the spirit of principal components) and has been formulated to fit more sparsely as loadings plus specific variances. This formulation also helps avoid getting negative definite covariance matrices. 

If you’re having a problem with missing data, read our blog post, where we discuss understanding the reasons for the missing data and how to apply appropriate methods to account for it. 

What issues will the wrong model cause in my data analysis?

Potential issues include the distribution of the data (binary, Normal, skewed), variance heterogeneity, outliers and missing values. In addition, there is the issue of the purpose of the analysis: to select individuals for breeding, to test a treatment hypothesis, estimate heritability, and understand genetic and phenotypic correlations. All of these affect the type and structure of the model fitted.  

If you are interested in learning more about which model and statistical method are most appropriate for your data set, read our recent blog post

Final thoughts on data analysis models

The linear mixed model, as implemented in ASReml, has allowed immense advances in global food productivity and health (plant, animal and human) via quantitative and marker-based genetics over the last 25 years. 

ASReml-R and ASReml-SA are designed to cater for a wide range of linear mixed models, being extensively used in plant and animal breeding, forestry and for analysis of human epidemiological data. That breadth of coverage is sometimes daunting to new users as most options are only relevant to particular situations.  

 VSNi is committed to helping new and experienced users identify the most appropriate model for their data although VSNi does not provide an analysis service as such; but the developers of ASReml have broad experience as biometricians/statisticians.  

If you’re looking for user guides, video tutorials, reference manuals, or anything else on the ASReml software, view our knowledge base. If you have any questions or want advice on using the data analysis software, get in touch with our team, who will be happy to help. 

About the author

Arthur Gilmour obtained his BSc Agr from Sydney University majoring in Biometry in 1970 with a NSW Government traineeship. He served as a biometrician until his retirement from NSW Agriculture as a Principal Research Scientist in 2009. The generalised linear mixed model has been the major research interest of Dr Gilmour, motivated by problems arising in research data generated by agricultural scientists.  

From the outset, he was involved in software development to meet the current statistical analysis needs of his clients and colleagues. He obtained his PhD in animal breeding from Massey University in 1983 during which time he came into contact with Robin Thompson. This led to an ongoing collaboration also with Brian Cullis, resulting in the development of ASReml in 1996.