Dr. Salvador A. Gezan2 months ago
I have been a user of ASReml-R for more than 10 years. I started in the pre-genomics era and I recall fitting an animal model during my PhD with more than half a million records in my pedigree file without any issues. This, back then, took a few minutes, and I was always excited to see a large file with all the breeding value estimates. Nowadays, this takes even less time, given that computers have more memory and computing power.
Now with the genomics era here, I have been moving from this pedigree-based animal model to a genomic-based animal model. The theory is simple: just replace your numerator relationship matrix A with a genomic-based relationship matrix G and you are now doing what is known as GBLUP. So, the process seems simple and straightforward, but it is not!
My first surprise was the level of complexity associated with the generation of the G matrix. First, I have to deal with many tricks on pre-processing my molecular data (often SNPs) in order to calculate my G matrix. But even after I obtain it, I still need to use more tricks (such as blending, or bending) in order to tune-up my G matrix to be invertible and well-conditioned.
However, the biggest shock comes once I use this ‘tuned-up’ G matrix in my GBLUP analysis using the library ASReml-R v. 4.1. Typically my analyses consider only two to four thousand genotyped individuals (a much smaller matrix than my old pedigree-based A matrix with half a million genotypes). But to my surprise, the analysis often takes hours, or does not work because of insufficient RAM memory. What is going wrong?
In order to understand this, let’s first see why the old A matrix worked so well. ASReml-R (and also ASReml-SA) generates the inverse of the A matrix in sparse form. This means that it is stored (and used) as a half-diagonal matrix, and only the non-zero values are kept and used for calculations. Indeed, ASReml uses sparse matrix methods for most of the calculations. This not only reduces the amount of memory needed for storage but also saves computing time. Interestingly, the A matrix I was fitting in the past was generated from only three generations where many of the individuals are unrelated, and therefore, I had a very empty (or sparse) matrix of additive relationships. This means that even though my matrix was very large, in reality it was full of zeros. I estimate that only 5-10% of the values were non-zero. The limited amount of non-zero data makes the analysis very efficient.
Now with the G matrix, even if only a 2,000 x 2,000 matrix is considered (and stored as half-diagonal), it will be full of non-zeros values (probably all 4 million of them!). And typically, its inverse is also full of non-zero values. This is expected from a genomic matrix as any, and all, genotypes are related to every other genotype (even if only with a very low level of relatedness). As you would expect, this matrix is much denser than my old A matrices, thus its analysis requires a lot more memory and it can’t make use of the sparse matrix methodology within ASReml, leading to the problems I mentioned above.
In addition, if I am using the R environment, then I have extra complications. The library ASReml-R calls the data frame objects (phenotypic and matrices) and generates as output potentially large objects (e.g., solutions). This implies that R needs to keep ALL the pre- and post-analysis objects in memory. As you can imagine this can use all of my laptop’s RAM memory and leaves me with little space for statistical analyses. In contrast, ASReml-SA reads the phenotypic data and generates the design matrices, but it does not keep the original data. ASReml-SA also writes the output into files as they are generated (for example the BLUEs and BLUPs); hence, memory is used in a more efficient way.
Not all is lost! There are several tricks to fit these models and get your results. Below I am going to list (and explain) some of the options that I have used/discovered to make my GBLUP analyses run with good success. This is not an exhaustive list of tricks, but hopefully they will give you a starting, and ideally finishing, point.
This is a reasonable option, and there are several software packages (and R libraries) that will likely handle specific analyses more efficiently. However, you have to be sure your complex linear mixed models are considered. For example, some routines do not allow for several fixed or random effects, or they are limited in the error structure specification (e.g., it may not be possible to fit a heterogeneous or bivariate error). In my case, this is rarely a viable option as the type of data that I normally work with requires advanced linear mixed modelling.
If your objective is to fit genomic prediction models, then GBLUP is one of many good options. Alternatively, you may be able to use a Bayesian approach (e.g., Bayes B) or a multiple regression algorithm by fitting ridge-regression or LASSO models. In all of these cases, you will have to rely on a different set of statistical assumptions. This option has worked for me, but on a limited based, as it is only possible with relatively simple models.
You can take advantage of the power and flexibility of ASReml within the stand-alone application, ASReml-SA. This executable software is more efficient, as indicated above, in terms of memory and size of the problem. There are some differences in notation and coding compared to ASReml-R but, as expected, the logic is very similar. I have used this strategy before with good success, and you might also notice that the latest ASReml-SA version 4.2 has several enhancements over the older 4.1 version.
Personally, I like the flexibility and control that the R environment provides, so much so that often I do not want to move away from R. But I have discovered some tricks that help me. Most of these rely on removing objects from memory that are no longer needed. For example, the G matrix is not required once you obtain the G inverse with the ASReml-R command ainverse(). Also, any intermediate matrix, or data frame can be eliminated. In addition, manipulations of large datasets can be more easily and efficiently done with designed for purpose libraries such as data.table. However, over the last few years newer versions of R have improved memory management, and I suspect this will continue as large datasets become the norm.
On a few occasions I have had analyses that are still too large for my computer, or that I need to complete quickly due to a tight deadline. Here, I have used cloud computing through AWS (but Microsoft Azure is another option). It is possible to install RStudio and ASReml-R on almost unthinkable machine configurations. If you have the budget for these services this is a good option. However, some effort is needed to set up these systems. Check this link for some good suggestion on how this can be minimized https://www.louisaslett.com/RStudio_AMI/.
ASReml-R (and also ASReml-SA) has several options that are useful when fitting any linear mixed model. To speed up convergency, probably most important is to provide some good initial variance components - ones that will be close to the final estimated values. This has been especially useful when comparing several models. Additionally, you can also move some of your fixed effects to the sparse portion (by using sparse), which will reduce the computing load. Finally, if your model is too complex (say MET with several spatial components), then a good alternative is to use a two-stage analyses where in the first step you fit each environment separately, and in the second step the predictions for each environment (and weights) are used for the GBLUP analyses. This procedure might have some loss of information, but the procedure is well established and it is more likely to converge!
When fitting the linear mixed model, the only genotypes that contribute to estimating the variance components or effects are those that have phenotypic responses. This implies that trimming down your G matrix to only those genotypes that have records in your phenotypic data will potentially reduce the size of the G matrix considerably. This has been my approach, particularly in cases where there is lots of historical genotyping that is not relevant for the given analysis of interest.
This has been my main ‘statistical’ strategy when I’ve only used the portion of the G matrix that has phenotypic observations to fit the linear mixed model (as discussed above) but I still need genomic predictions for the other genotypes. Here, after your model is fitted you use the estimated genomic breeding values (GEBVs) together with the full G matrix to obtain what is called conditional predictions. More details are provided under the heading Conditional Prediction at the end of this blog. This is also a useful strategy any time you need predictions for future genotyped individuals as it does not require refitting the original genomic prediction model.
As you can see, there are several options to speed-up your analysis, but the “best” strategy will depend on the type of problem. Often, I have dealt with a small number of individuals with both genotypes and phenotypes, so using a portion of the G matrix has been a good trick; however, I expect that in the next few years the size of my G matrix will grow considerably.
Nevertheless, there is progress in other areas too. Current developments within VSNi in relation to the core ASReml algorithms (the computational engine underlying the ASReml family of products, including ASReml-R) are focused on making this type of ‘dense’ model much easier and faster to fit. Some of the current work includes:
Some of these features are already in the latest ASReml-SA version 4.2, but in the next few versions of ASReml-R I expect to see these and many more relevant computational improvements. In the near future, I am hoping to perform my GBLUP analyses with tens of thousands of individuals without any memory problem and at a reasonable speed. And, in the case of the R environment, I am also expecting progress in those aspects too.
The last thought from me is related to future scientific research and development. There is need to perform additional statistical, numerical or even computational research in the area of GBLUP. I am sure, there will be good ‘sparse’ approximation algorithms to be used in place of the G matrix, or some clever matrix algebra that will produce better manipulations of our mixed model equations. We should all keep an eye out for new developments to see what is available and what is new to make these analyses more operationally manageable.
Let’s consider a matrix that is partitioned in the following way:
Here, the (a matrix of dimension x ) corresponds to the portion of the genomic matrix for those individuals (group 1) for which we are interested in estimating GEBVs. Similarly, (a matrix of dimension x ) contains the portion of the genomic matrix with those individuals (group 2) for which we already have GEBVs (these estimates are found in , a vector of length ). They were obtained from fitting a GBLUP for individuals with phenotypic records, but this can also be from any set of individuals with both genomic information and estimated GEBVs.
To obtain the ‘conditional’ estimates of the GEBVs for the individuals in group 1, which we will identify as the vector of length , we use the following expression of conditional prediction based on a multivariate normal distribution:
This expression requires the inverse of . Note, that in strict statistical rigor this estimation should be identified as |, corresponding to the estimation of given (or conditional on) .
If the variance-covariance matrix of the estimations is available, i.e. , then it is possible to estimate the variance covariance of using the expression:
This can be used to calculate prediction error variances (PEVs) or accuracies.
Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world.
Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook “Statistical Methods in Biology: Design and Analysis of Experiments and Regression”.
The VSNi Team7 months ago
Outliers are sample observations that are either much larger or much smaller than the other observations in a dataset. Outliers can skew your dataset, so how should you deal with them?
Imagine Jane, the general manager of a chain of computer stores, has asked a statistician, Vanessa, to assist her with the analysis of data on the daily sales at the stores she manages. Vanessa takes a look at the data, and produces a boxplot for each of the stores as shown below.
Vanessa pointed out to Jane the presence of outliers in the data from Store 2 on days 10 and 22. Vanessa recommended that Jane checks the accuracy of the data. Are the outliers due to recording or measurement error? If the outliers can’t be attributed to errors in the data, Jane should investigate what might have caused the increased sales on these two particular days. Always investigate outliers - this will help you better understand the data, how it was generated and how to analyse it.
Vanessa explained to Jane that we should never drop a data value just because it is an outlier. The nature of the outlier should be investigated before deciding what to do.
Whenever there are outliers in the data, we should look for possible causes of error in the data. If you find an error but cannot recover the correct data value, then you should replace the incorrect data value with a missing value.
However, outliers can also be real observations, and sometimes these are the most interesting ones! If your outlier can’t be attributed to an error, you shouldn’t remove it from the dataset. Removing data values unnecessarily, just because they are outliers, introduces bias and may lead you to draw the wrong conclusions from your study.
In our example, Vanessa suggested that since the mean for Store 2 is highly influenced by the outliers, the median, another measure of central tendency, seems more appropriate for summarizing the daily sales at each store. Using the statistical software Genstat, Vanessa can easily calculate both the mean and median number of sales per store for Jane.
Vanessa also analyses the data assuming the daily sales have Poisson distributions, by fitting a log-linear model.
Notice that Vanessa has included “Day” as a blocking factor in the model to allow for variability due to temporal effects.
From this analysis, Vanessa and Jane conclude that the means (of the Poisson distributions) differ between the stores (p-value < 0.001). Store 3, on average, has the most computer sales per day, whereas Stores 1 and 4, on average, have the least.
There are other statistical approaches Vanessa might have used to analyse Jane’s sales data, including a one-way ANOVA blocked by Day on the log-transformed sales data and Friedman’s non-parametric ANOVA. Both approaches are available in Genstat’s comprehensive menu system.
There are many ways to deal with outliers, but no single method will work in every situation. As we have learnt, we can remove an observation if we have evidence it is an error. But, if that is not the case, we can always use alternative summary statistics, or even different statistical approaches, that accommodate them.
Dr. John Rogers9 months ago
Earlier this year I had an enquiry from Carey Langley of VSNi as to why I had not renewed my Genstat licence. The truth was simple – I have decided to fully retire after 50 years as an agricultural entomologist / applied biologist / consultant. This prompted some reflections about the evolution of bioscience data analysis that I have experienced over that half century, a period during which most of my focus was the interaction between insects and their plant hosts; both how insect feeding impacts on plant growth and crop yield, and how plants impact on the development of the insects that feed on them and on their natural enemies.
My journey into bioscience data analysis started with undergraduate courses in biometry – yes, it was an agriculture faculty, so it was biometry not statistics. We started doing statistical analyses using full keyboard Monroe calculators (for those of you who don’t know what I am talking about, you can find them here). It was a simpler time and as undergraduates we thought it was hugely funny to divide 1 by 0 until the blue smoke came out…
After leaving university in the early 1970s, I started working for the Agriculture Department of an Australian state government, at a small country research station. Statistical analysis was rudimentary to say the least. If you were motivated, there was always the option of running analyses yourself by hand, given the appearance of the first scientific calculators in the early 1970s. If you wanted a formal statistical analysis of your data, you would mail off a paper copy of the raw data to Biometry Branch… and wait. Some months later, you would get back your ANOVA, regression, or whatever the biometrician thought appropriate to do, on paper with some indication of what treatments were different from what other treatments. Dose-mortality data was dealt with by manually plotting data onto probit paper.
In-house ANOVA programs running on central mainframes were a step forward some years later as it at least enabled us to run our own analyses, as long as you wanted to do an ANOVA…. However, it also required a 2 hours’ drive to the nearest card reader, with the actual computer a further 1000 kilometres away.… The first desktop computer I used for statistical analysis was in the early 1980s and was a CP/M machine with two 8-inch floppy discs with, I think, 256k of memory, and booting it required turning a key and pressing the blue button - yes, really! And about the same time, the local agricultural economist drove us crazy extolling the virtues of a program called Lotus 1-2-3!
Having been brought up on a solid diet of the classic texts such as Steele and Torrie, Cochran and Cox and Sokal and Rohlf, the primary frustration during this period was not having ready access to the statistical analyses you knew were appropriate for your data. Typical modes of operating for agricultural scientists in that era were randomised blocks of various degrees of complexity, thus the emphasis on ANOVA in the software that was available in-house. Those of us who also had less-structured ecological data were less well catered for.
My first access to a comprehensive statistics package was during the early to mid-1980s at one of the American Land Grant universities. It was a revelation to be able to run virtually whatever statistical test deemed necessary. Access to non-linear regression was a definite plus, given the non-linear nature of many biological responses. As well, being able to run a series of models to test specific hypotheses opened up new options for more elegant and insightful analyses. Looking back from 2021, such things look very trivial, but compared to where we came from in the 1970s, they were significant steps forward.
My first exposure to Genstat, VSNi’s stalwart statistical software package, was Genstat for Windows, Third Edition (1997). Simple things like the availability of residual plots made a difference for us entomologists, given that much of our data had non-normal errors; it took the guesswork out of whether and what transformations to use. The availability of regressions with grouped data also opened some previously closed doors.
After a deviation away from hands-on research, I came back to biological-data analysis in the mid-2000s and found myself working with repeated-measures and survival / mortality data, so ventured into repeated-measures restricted maximum likelihood analyses and generalised linear mixed models for the first time (with assistance from a couple of Roger Payne’s training courses in Hobart and Queenstown). Looking back, it is interesting how quickly I became blasé about such computationally intensive analyses that would run in seconds on my laptop or desktop, forgetting that I was doing ANOVAs by hand 40 years earlier when John Nelder was developing generalised linear models. How the world has changed!
Of importance to my Genstat experience was the level of support that was available to me as a Genstat licensee. Over the last 15 years or so, as I attempted some of these more complex analyses, my aspirations were somewhat ahead of my abilities, and it was always reassuring to know that Genstat Support was only ever an email away. A couple of examples will flesh this out.
Back in 2008, I was working on the relationship between insect-pest density and crop yield using R2LINES, but had extra linear X’s related to plant vigour in addition to the measure of pest infestation. A support-enquiry email produced an overnight response from Roger Payne that basically said, “Try this”. While I slept, Roger had written an extension to R2LINES to incorporate extra linear X’s. This was later incorporated into the regular releases of Genstat. This work led to the clearer specification of the pest densities that warranted chemical control in soybeans and dry beans (https://doi.org/10.1016/j.cropro.2009.08.016 and https://doi.org/10.1016/j.cropro.2009.08.015).
More recently, I was attempting to disentangle the effects on caterpillar mortality of the two Cry insecticidal proteins in transgenic cotton and, while I got close, I would not have got the analysis to run properly without Roger’s support. The data was scant in the bottom half of the overall dose-response curves for both Cry proteins, but it was possible to fit asymptotic exponentials that modelled the upper half of each curve. The final double-exponential response surface I fitted with Roger’s assistance showed clearly that the dose-mortality response was stronger for one of the Cry proteins than the other, and that there was no synergistic action between the two proteins (https://doi.org/10.1016/j.cropro.2015.10.013)
One thing that I especially appreciate about having access to a comprehensive statistics package such as Genstat is having the capacity to tease apart biological data to get at the underlying relationships. About 10 years ago, I was asked to look at some data on the impact of cold stress on the expression of the Cry2Ab insecticidal protein in transgenic cotton. The data set was seemingly simple - two years of pot-trial data where groups of pots were either left out overnight or protected from low overnight temperatures by being moved into a glasshouse, plus temperature data and Cry2Ab protein levels. A REML analysis, and some correlations and regressions enabled me to show that cold overnight temperatures did reduce Cry2Ab protein levels, that the effects occurred for up to 6 days after the cold period and that the threshold for these effects was approximately 14 Cº (https://doi.org/10.1603/EC09369). What I took from this piece of work is how powerful a comprehensive statistics package can be in teasing apart important biological insights from what was seemingly very simple data. Note that I did not use any statistics that were cutting edge, just a combination of REML, correlation and regression analyses, but used these techniques to guide the dissection of the relationships in the data to end up with an elegant and insightful outcome.
Looking back over 50 years of work, one thing stands out for me: the huge advances that have occurred in the statistical analysis of biological data has allowed much more insightful statistical analyses that has, in turn, allowed biological scientists to more elegantly pull apart the interactions between insects and their plant hosts.
For me, Genstat has played a pivotal role in that process. I shall miss it.
Dr John Rogers
Research Connections and Consulting
St Lucia, Queensland 4067, Australia
Phone/Fax: +61 (0)7 3720 9065
Mobile: 0409 200 701
Alternate email: D.John.Rogers@gmail.com
Kanchana Punyawaew9 months ago
This blog illustrates how to analyze data from a field experiment with a balanced lattice square design using linear mixed models. We’ll consider two models: the balanced lattice square model and a spatial model.
The example data are from a field experiment conducted at Slate Hall Farm, UK, in 1976 (Gilmour et al., 1995). The experiment was set up to compare the performance of 25 varieties of barley and was designed as a balanced lattice square with six replicates laid out in a 10 x 15 rectangular grid. Each replicate contained exactly one plot for every variety. The variety grown in each plot, and the coding of the replicates and lattice blocks, is shown in the field layout below:
There are seven columns in the data frame: five blocking factors (Rep, RowRep, ColRep, Row, Column), one treatment factor, Variety, and the response variate, yield.
The six replicates are numbered from 1 to 6 (Rep). The lattice block numbering is coded within replicates. That is, within each replicates the lattice rows (RowRep) and lattice columns (ColRep) are both numbered from 1 to 5. The Row and Column factors define the row and column positions within the field (rather than within each replicate).
To analyze the response variable, yield, we need to identify the two basic components of the experiment: the treatment structure and the blocking (or design) structure. The treatment structure consists of the set of treatments, or treatment combinations, selected to study or to compare. In our example, there is one treatment factor with 25 levels, Variety (i.e. the 25 different varieties of barley). The blocking structure of replicates (Rep), lattice rows within replicates (Rep:RowRep), and lattice columns within replicates (Rep:ColRep) reflects the balanced lattice square design. In a mixed model analysis, the treatment factors are (usually) fitted as fixed effects and the blocking factors as random.
The balanced lattice square model is fitted in ASReml-R4 using the following code:
> lattice.asr <- asreml(fixed = yield ~ Variety, random = ~ Rep + Rep:RowRep + Rep:ColRep, data=data1)
The REML log-likelihood is -707.786.
The model’s BIC is:
The estimated variance components are:
The table above contains the estimated variance components for all terms in the random model. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. The variance components for Rep, Rep:RowRep and Rep:ColRep are estimated as 4263, 15596, and 14813, respectively. As is typical, the largest unit (replicate) is more variable than its sub-units (lattice rows and columns within replicates). The "units!R" component is the residual variance.
By default, fixed effects in ASReml-R4 are tested using sequential Wald tests:
In this example, there are two terms in the summary table: the overall mean, (Intercept), and Variety. As the tests are sequential, the effect of the Variety is assessed by calculating the change in sums of squares between the two models (Intercept)+Variety and (Intercept). The p-value (Pr(Chisq)) of < 2.2 x 10-16 indicates that Variety is a highly significant.
The predicted means for the Variety can be obtained using the predict() function. The standard error of the difference between any pair of variety means is 62. Note: all variety means have the same standard error as the design is balanced.
Note: the same analysis is obtained when the random model is redefined as replicates (Rep), rows within replicates (Rep:Row) and columns within replicates (Rep:Column).
As the plots are laid out in a grid, the data can also be analyzed using a spatial model. We’ll illustrate spatial analysis by fitting a model with a separable first order autoregressive process in the field row (Row) and field column (Column) directions. This is often a useful model to start the spatial modeling process.
The separable first order autoregressive spatial model is fitted in ASReml-R4 using the following code:
> spatial.asr <- asreml(fixed = yield ~ Variety, residual = ~ar1(Row):ar1(Column), data = data1)
The BIC for this spatial model is:
The estimated variance components and sequential Wald tests are:
The residual variance is 38713, the estimated row correlation is 0.458, and the estimated column correlation is 0.684. As for the balanced lattice square model, there is strong evidence of a Variety effect (p-value < 2.2 x 10-16).
A log-likelihood ratio test cannot be used to compare the balanced lattice square model with the spatial models, as the variance models are not nested. However, the two models can be compared using BIC. As the spatial model has a smaller BIC (1415) than the balanced lattice square model (1435), of the two models explored in this blog, it is chosen as the preferred model. However, selecting the optimal spatial model can be difficult. The current spatial model can be extended by including measurement error (or nugget effect) or revised by selecting a different variance model for the spatial effects.
Butler, D.G., Cullis, B.R., Gilmour, A. R., Gogel, B.G. and Thompson, R. (2017). ASReml-R Reference Manual Version 4. VSN International Ltd, Hemel Hempstead, HP2 4TP UK.
Gilmour, A. R., Anderson, R. D. and Rae, A. L. (1995). The analysis of binomial data by a generalised linear mixed model, Biometrika 72: 593-599..