How can I make my GBLUP analysis in ASReml-R run faster?

How can I make my GBLUP analysis in ASReml-R run faster?

Dr. Salvador A. Gezan

11 October 2021
image_blog

I have been a user of ASReml-R for more than 10 years. I started in the pre-genomics era and I recall fitting an animal model during my PhD with more than half a million records in my pedigree file without any issues. This, back then, took a few minutes, and I was always excited to see a large file with all the breeding value estimates. Nowadays, this takes even less time, given that computers have more memory and computing power.

Now with the genomics era here, I have been moving from this pedigree-based animal model to a genomic-based animal model. The theory is simple: just replace your numerator relationship matrix A with a genomic-based relationship matrix G and you are now doing what is known as GBLUP. So, the process seems simple and straightforward, but it is not!

My first surprise was the level of complexity associated with the generation of the G matrix. First, I have to deal with many tricks on pre-processing my molecular data (often SNPs) in order to calculate my G matrix. But even after I obtain it, I still need to use more tricks (such as blending, or bending) in order to tune-up my G matrix to be invertible and well-conditioned. 

However, the biggest shock comes once I use this ‘tuned-up’ G matrix in my GBLUP analysis using the library ASReml-R v. 4.1. Typically my analyses consider only two to four thousand genotyped individuals (a much smaller matrix than my old pedigree-based A matrix with half a million genotypes). But to my surprise, the analysis often takes hours, or does not work because of insufficient RAM memory. What is going wrong?

In order to understand this, let’s first see why the old A matrix worked so well. ASReml-R (and also ASReml-SA) generates the inverse of the A matrix in sparse form. This means that it is stored (and used) as a half-diagonal matrix, and only the non-zero values are kept and used for calculations. Indeed, ASReml uses sparse matrix methods for most of the calculations. This not only reduces the amount of memory needed for storage but also saves computing time. Interestingly, the A matrix I was fitting in the past was generated from only three generations where many of the individuals are unrelated, and therefore, I had a very empty (or sparse) matrix of additive relationships. This means that even though my matrix was very large, in reality it was full of zeros. I estimate that only 5-10% of the values were non-zero. The limited amount of non-zero data makes the analysis very efficient.

Now with the G matrix, even if only a 2,000 x 2,000 matrix is considered (and stored as half-diagonal), it will be full of non-zeros values (probably all 4 million of them!). And typically, its inverse is also full of non-zero values. This is expected from a genomic matrix as any, and all, genotypes are related to every other genotype (even if only with a very low level of relatedness). As you would expect, this matrix is much denser than my old A matrices, thus its analysis requires a lot more memory and it can’t make use of the sparse matrix methodology within ASReml, leading to the problems I mentioned above.

In addition, if I am using the R environment, then I have extra complications. The library ASReml-R calls the data frame objects (phenotypic and matrices) and generates as output potentially large objects (e.g., solutions). This implies that R needs to keep ALL the pre- and post-analysis objects in memory. As you can imagine this can use all of my laptop’s RAM memory and leaves me with little space for statistical analyses. In contrast, ASReml-SA reads the phenotypic data and generates the design matrices, but it does not keep the original data. ASReml-SA also writes the output into files as they are generated (for example the BLUEs and BLUPs); hence, memory is used in a more efficient way. 

8 tricks to make your GBLUP analyses run quicker

Not all is lost! There are several tricks to fit these models and get your results. Below I am going to list (and explain) some of the options that I have used/discovered to make my GBLUP analyses run with good success. This is not an exhaustive list of tricks, but hopefully they will give you a starting, and ideally finishing, point.

1) Use alternative software

This is a reasonable option, and there are several software packages (and R libraries) that will likely handle specific analyses more efficiently. However, you have to be sure your complex linear mixed models are considered. For example, some routines do not allow for several fixed or random effects, or they are limited in the error structure specification (e.g., it may not be possible to fit a heterogeneous or bivariate error). In my case, this is rarely a viable option as the type of data that I normally work with requires advanced linear mixed modelling.

2) Move away from GBLUP

If your objective is to fit genomic prediction models, then GBLUP is one of many good options. Alternatively, you may be able to use a Bayesian approach (e.g., Bayes B) or a multiple regression algorithm by fitting ridge-regression or LASSO models. In all of these cases, you will have to rely on a different set of statistical assumptions. This option has worked for me, but on a limited based, as it is only possible with relatively simple models.

3) Move away from the R environment

You can take advantage of the power and flexibility of ASReml within the stand-alone application, ASReml-SA. This executable software is more efficient, as indicated above, in terms of memory and size of the problem. There are some differences in notation and coding compared to ASReml-R but, as expected, the logic is very similar. I have used this strategy before with good success, and you might also notice that the latest ASReml-SA version 4.2 has several enhancements over the older 4.1 version. 

4) Make R more efficient

Personally, I like the flexibility and control that the R environment provides, so much so that often I do not want to move away from R. But I have discovered some tricks that help me. Most of these rely on removing objects from memory that are no longer needed. For example, the G matrix is not required once you obtain the G inverse with the ASReml-R command ainverse(). Also, any intermediate matrix, or data frame can be eliminated. In addition, manipulations of large datasets can be more easily and efficiently done with designed for purpose libraries such as data.table. However, over the last few years newer versions of R have improved memory management, and I suspect this will continue as large datasets become the norm.

5) Move to Cloud Computing

On a few occasions I have had analyses that are still too large for my computer, or that I need to complete quickly due to a tight deadline. Here, I have used cloud computing through AWS (but Microsoft Azure is another option). It is possible to install RStudio and ASReml-R on almost unthinkable machine configurations. If you have the budget for these services this is a good option. However, some effort is needed to set up these systems. Check this link for some good suggestion on how this can be minimized https://www.louisaslett.com/RStudio_AMI/.

6) Use some ASReml-R tricks

ASReml-R (and also ASReml-SA) has several options that are useful when fitting any linear mixed model. To speed up convergency, probably  most important is to provide some good initial variance components - ones that will be  close to the final estimated values. This has been especially useful when comparing several models. Additionally, you can also move some of your fixed effects to the sparse portion (by using sparse), which will reduce the computing load. Finally, if your model is too complex (say MET with several spatial components), then a good alternative is to use a two-stage analyses where in the first step you fit each environment separately, and in the second step the predictions for each environment (and weights) are used for the GBLUP analyses. This procedure might have some loss of information, but the procedure is well established and it is more likely to converge!

7) Use a portion of the matrix G in your analysis

When fitting the linear mixed model, the only genotypes that contribute to estimating the variance components or effects are those that have phenotypic responses. This implies that trimming down your G matrix to only those genotypes that have records in your phenotypic data will potentially reduce the size of the G matrix considerably. This has been my approach, particularly in cases where there is lots of historical genotyping that is not relevant for the given analysis of interest.

8) Use conditional predictions with a portion of the matrix G

This has been my main ‘statistical’ strategy when I’ve only used the portion of the G matrix that has phenotypic observations to fit the linear mixed model (as discussed above) but I still need genomic predictions for the other genotypes. Here, after your model is fitted you use the estimated genomic breeding values (GEBVs) together with the full G matrix to obtain what is called conditional predictions. More details are provided under the heading Conditional Prediction at the end of this blog.  This is also a useful strategy any time you need predictions for future genotyped individuals as it does not require refitting the original genomic prediction model.  

As you can see, there are several options to speed-up your analysis, but the “best” strategy will depend on the type of problem. Often, I have dealt with a small number of individuals with both genotypes and phenotypes, so using a portion of the G matrix has been a good trick; however, I expect that in the next few years the size of my G matrix will grow considerably.

Nevertheless, there is progress in other areas too. Current developments within VSNi in relation to the core ASReml algorithms (the computational engine underlying the ASReml family of products, including ASReml-R) are focused on making this type of ‘dense’ model much easier and faster to fit. Some of the current work includes: 

  • Improving memory management of large matrices.
  • Implementing alternative algorithms for:
    • re-ordering of the mixed model equations (reducing fill-in, and thus efforts to obtain matrix inverses).
    • several matrix operations (such as, algorithms for inverting some specific classes of matrices).
  • Increasing the use of parallelization on critical routines (using more cores will speed up the process).

Some of these features are already in the latest ASReml-SA version 4.2, but in the next few versions of ASReml-R I expect to see these and many more relevant computational improvements. In the near future, I am hoping to perform my GBLUP analyses with tens of thousands of individuals without any memory problem and at a reasonable speed. And, in the case of the R environment, I am also expecting progress in those aspects too.

The last thought from me is related to future scientific research and development. There is need to perform additional statistical, numerical or even computational research in the area of GBLUP. I am sure, there will be good ‘sparse’ approximation algorithms to be used in place of the G matrix, or some clever matrix algebra that will produce better manipulations of our mixed model equations. We should all keep an eye out for new developments to see what is available and what is new to make these analyses more operationally manageable.

Conditional prediction

Let’s consider a matrix that is partitioned in the following way:



Here, the (a matrix of dimension x ) corresponds to the portion of the genomic matrix for those individuals (group 1) for which we are interested in estimating GEBVs. Similarly, (a matrix of dimension x ) contains the portion of the genomic matrix with those individuals (group 2) for which we already have GEBVs (these estimates are found in , a vector of length ). They were obtained from fitting a GBLUP for individuals with phenotypic records, but this can also be from any set of individuals with both genomic information and estimated GEBVs.

To obtain the ‘conditional’ estimates of the GEBVs for the individuals in group 1, which we will identify as the vector of length , we use the following expression of conditional prediction based on a multivariate normal distribution: 



This expression requires the inverse of . Note, that in strict statistical rigor this estimation should be identified as |, corresponding to the estimation of given (or conditional on)

If the variance-covariance matrix of the estimations is available, i.e. , then it is possible to estimate the variance covariance of using the expression:



This can be used to calculate prediction error variances (PEVs) or accuracies.

About the author

Dr. Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world. 

Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook Statistical Methods in Biology: Design and Analysis of Experiments and Regression.