Dr. Salvador A. Gezan05 July 2022
It is common for breeding analyses to have a phenotypic response that is non-Normally distributed. For example, we can have for each genotype survival, or count of flowers or eggs produced, and these will typically follow a Binomial or Poisson distribution, respectively. The issue is then, how do we deal with the quantitative genetics analyses of binary or count data, as our linear mixed model (LMM) framework is based on Normality assumptions, and these responses are clearly non-Normal.
Most statistical software, including ASReml and Genstat, deal with LMMs under Normal data in a very efficient and flexible way, providing strong statistical inference. ASReml and Genstat, under an LMM, can deal with heterogeneous errors, correlated structures (in time and space) and even correlated random effects (as when we use a relationship matrix). However, under non-Normal responses this becomes increasingly difficult. An alternative is to fit a Generalised Linear Mixed Model (GLMM) or Hierarchical Generalised Linear Models (HGLM). These model types consider the underlying probability distribution and allow us to estimate variance components for relevant random effects. GLMMs are implemented in ASReml-SA and ASReml-R with an array of available distributions and link functions. GLMMs can incorporate pedigree, some level of imbalance, and complex variance structures. However, they are not free of complications; for example, they might not converge when complex correlated structures are needed, such as for multi-environment trial and multi-trait analyses, as they do not rely on a multi-variate Normal distribution.
Given the complexity of this data, what are our analytical options? In this blog, we will talk about binary data, and provide you with a few options on how to analyse this data within the context of an operational breeding program. Some of the options are more practical (and often approximated). However, they can help you with providing you the required information and with reasonable accuracies to perform your selection of outstanding genotypes..
For the options below, we will consider that we have a dataset on which each of the n individuals was recorded for a binary response. That is, a 0/1 outcome for a given trait of interest. For example, if they show resistance (1 for success) or not (0 for failure) to a given viral infection. Note that the dataset can have additional factors and covariates, but it will have a total of n observations.
There are several options to deal with this type of data to get exact or approximated results. Below we are going to list (and explain) some of the options we have found useful, with pros and cons. As you might expect, this is not exhaustive, but these options will give you a starting point at the very least.
1. Use Generalized Linear Mixed Models (GLMMs)
GLMMs are an appropriate statistical tool to use with this type of data. As indicated above, they will consider the underlying distribution and they will allow you to estimate the variance components of interest. However, one aspect to consider is that these models are more prone to convergency issues, and there’s a high chance of getting stuck at a local minimum as the algorithms minimize over a more complex likelihood function than under REML for LMMs. This problem is reduced if you use simpler models with little or no imbalance, but they still require careful checking (including whether they make sense biologically).
Also, with GLMMs there are additional elements to deal with, such as which link function to use and how to manage overdispersion. You also have to consider that most of their statistical inference is based on asymptotic assumptions, so hypothesis testing and confidence intervals assume infinite degrees of freedom. GLMMs parameter estimates can have some bias, particularly if your counts are small. In addition, because you are dealing with a Binomial distribution, you will not be able to use the same definition of heritability ( = VA/VP) as we use under Normally distributed data, and there is no explicit error variance!
Nevertheless, GLMMs if they fit properly, will weight the observations appropriately, provide predictions within the natural range of the distribution, and depending on the link used, have some reasonable underlying biological assumptions (as with a Probit link function). Hence, they represent the natural choice for ‘good’ data and ‘simple’ models, and they are widely accepted in the community.
A close alternative to GLMMs is the use of HGLMs which also allow for the incorporation of random effects, with the increased flexibility that they can have several distributions for the random terms, and not only the Normal distribution as with GLMMs. In addition, this class of models tends to present less bias on parameter estimates. We do not discuss further HGLMs as these have other limitations and they are currently not implemented in ASReml.
2. Ignore completely the issue of non-Normality
A different option is to ignore that you have binary data, and fit a LMM using REML based on a Normal distribution for your residuals. Of course, this will create issues. You will see that your residual plots are very strange (2 groups of points) and inferences will be unreliable. Also, you’ll likely have increased bias of the variance component estimates. However, if you use large quantities of data, these problems are less likely to arise. This is because the distribution of your random effects is still a Normal distribution, and therefore, for example parental effects, are estimated based on some form of ‘average’ over all offspring and relatives. If you have a moderate to large sample size, this average, under the central limit theorem (CLT), will be a good approximation to a Normal distribution.
There are several disadvantages to this option. The main one is that there is no boundary associated with the natural range of the response due to the assumption of Normality. For example, you might obtain a prediction for a given family outside the range of 0 to 1 (for example, 1.03). This does not make sense and makes interpretation difficult (like saying that 103% of the individuals in this family will be susceptible!). Other disadvantages are associated with unreliable statistical inference in terms of p-values and confidence intervals.
But there are advantages to using this approach. The main one is that you will be able to estimate a heritability using your traditional equation ( = VA/VP). In addition, this option allows you to fit some more complex models, such as spatial models with correlated errors, and multi-trait models with the use of a multi-variate Normal distribution and correlated errors.
Our experience is that this approach often works well, and it is good to use in combination with the previous GLMM option. A look at the predictions, or BLUPs (EBVs or genetic effects) estimates, from both the GLMM and LMM models is good practice and for us has resulted in high correlations (> 0.95) with the top (and bottom) individuals. Hence, the ‘ignore non-Normality’ approach seems reasonable, particularly under some operational breeding decisions with little loss of information.
3. Implement a transformation of your response variable
Historically, the main recommendation to deal with non-Normal data has been the use of a transformation and then consider this new variable as approximately Normally distributed. This has been the case for Binomial, binary and Poisson data with widely accepted transformations. For binary data we can use a logit transformation, or an arcsine-square root transformation (also known as the angular transformation). Both transformations will provide reasonable approximations to a Normal distribution. But the most important aspect, is that they will ensure your predictions (means, etc.) are contained within the natural range of your original data (i.e., 0 to 1). In addition, the calculation of the heritability in this case is possible using our traditional expression.
This approach is easy to implement, and widely used. But its main disadvantage is that BLUPs and/or predictions are on the transformed scale, and they will need to be back-transformed. However, for standard errors (SEs) there is no back-transformation (note, one alternative is to back-transform confidence limits). In addition, you might find that, even after trying several transformations, your residuals might still not look as Normally distributed as you would like.
In our experience, this is a very good practical alternative for binary data. And, as indicated before, comparing with the previous two model options (GLMM and LMM) will provide with some confidence in its application and the resulting genotype selection. However, be aware, that this approach is frowned upon in some academic circles!
4. Collapse data into a higher stratum
One good alternative is to simplify the data. This can be done by collapsing levels (or experimental strata) in your dataset. For example, if you have 30 plants per plot for which a binary response has been recorded, it is possible to move from the level of the individual plant to the level of the individual plot but considering the average of these 30 plants. And this average becomes effectively a proportion bounded between 0 and 1.
The above approach has several advantages. Firstly, the complexity of the model gets reduced, and hence it is more likely to converge properly. Secondly, given that now the response is an average, this will have an approximate Normal distribution based again on the CLT, and it can be a good approximation for moderate samples used to calculate these averages. In our example, a mean of 30 plants provides with a reasonable approximation. Thirdly, and most importantly, with the use of these averages we can more ‘safely’ rely on the assumption of Normally distributed residuals, and this opens the door to fit LMM complex models (e.g., with temporal correlation or multi-trait models).
There are disadvantages though, the main one being is that our response variable represents a different stratum; hence, heritability must be interpreted at a different stratum (e.g., heritability of the plot average not of the individual plant). Another important statistical disadvantage is that the weight for each observation is different than under a GLMM, and hence the inference will change but its quality will depend on the quality of the approximation. Also, as the LMM analysis has no boundary associated with the response, predictions outside its natural range of the proportion are still possible, but these are less likely to occur. Lastly, some conditions for a good approximation to the CLT need to be observed; for example, when observations are at the extremes (all 0s or all 1s) distribution of residuals will not be reasonable.
In our experience, this is a very good option that does not rely on ignoring the problem (option 2) or manipulating the response with a transformation (option 3). Instead, it relies purely on the beauty of the central limit theorem. If we have plot averages based on 50 plants, then we expect almost no differences with a GLMM model with its appropriate distribution (you can check this and you will see that p-values, predictions and SEs are very similar!).
5. Consider Bayesian approaches to fit GLMM
There are a few sophisticated Bayesian packages that can fit an array of our common genetic models. For example, Stan, WinBUGS and the R libraries MCMCglmm and brms are good examples. Under flexible assumptions, Bayesian software will allow to specify almost any distributions for the response variable, and our ‘random’ and ‘fixed’ effects included in the model. One important advantage of these models is that in comparison with the other options they tend to present less bias on the parameter estimates. Hence, Bayesian models constitute a good option to fit simple and complex models with to our binary response (and also almost any response).
The main disadvantage of these models is the additional level of complexity in fitting these models. This includes, among other things, selecting and clearly defining a prior distribution for each model parameter, and determining the number of iterations, length of the burn-in and degree of thinning. And there is also increased work to do on using different diagnostic tools to assess the quality of the fitted Bayesian model.
Our experience is that a Bayesian model constitutes a great alternative to the methods above if the time and effort is invested in understanding the different software and logic between a frequentist and a Bayesian approach. These models are likely to be more stable, particularly under unbalanced data, and may allow you to fit complex models that are not possible to fit under either LMM or GLMM. Comparison of these models with the more traditional approaches has resulted in high correlations in the breeding values (> 0.95). And something not to forget is that a good Bayesian model allows for more in-depth statistical inferences, for example, we will have posterior distributions of heritability calculation estimates!
In this blog we have presented five options to consider when dealing with binary data, and these can be empirically evaluated to tune-up and decide which is best for your breeding program. However, there are some general aspects of a binary response associated with the amount of information this type of data contains. Binary responses contain only 0s or 1s. Thus, a binary variable is a less informative measure than a continuous variable (think about standard deviations, modes, etc.). Ideally, a binary trait should have a proportion of 1s or 0s close to 50% as at this level the information is maximized. For example, if the proportion of success is 2%, then there are very few data points that actually contribute to differentiate effects (say parents) and to calculate variance components associated with a given factor. For this low level of successes you can almost think that the measurements on your response variance are random!
Another aspect to consider is associated with the strata of the experiment, as mentioned above. It is recommended to carefully plan the design of the experiment, for example to ensure a reasonable number of individuals per plot. This is important if we intend to collapse the data, but it also helps to ensure we have sufficient information for the estimation of variance components at each stratum. If you plan for sufficient degrees of freedom per stratum, it is more likely that the model will fit successfully under either LMM or GLMM, and provide you in addition, reasonable accuracy for your effects and good statistical power for inferences.
A more practical aspect is associated with disease or challenge experiments. This is, for example, when a spore is used to infect plants. Recall that the no information is obtained with ~0% of the genotypes healthy and ~100% infected. This implies that the load of spores (or any other infection approach) has to be well defined to not: a) kill all plants, or b) not kill a single plant. Hence, we recommend that an experiment should be plan at aiming around 50% success. In addition, as with many other similar studies, it is critical to have positive and negative controls to assess the effectiveness of the application.
Finally, it is important that if binary (or Binomial) data is going to be the focus of the analysis, then the study should be planned using the simplest experimental design possible (for example favour randomized complete blocks instead of incomplete block designs), and minimize the level of complexity in the design (i.e., not consider split-plot designs). Always be aware that for fitting GLMMs, it is better to have simple and balanced experiments. And remember, binary responses are information poor, so keep on mind that you might not get as exciting results as you might expect!
Dr. Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world.
Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook Statistical Methods in Biology: Design and Analysis of Experiments and Regression.
Dr. Andrew Illius and Dr. Nick Savill20 July 2022
Quantification holds the key to controlling disease
Andrew Illius with Nick Savill have, since 2018, studied the epidemiology and control of maedi-visna virus (MV) in sheep and have been looking at understanding and finding ways of controlling this incurable disease. Accessing published data and with the use of Genstat, they aimed to find ways of controlling MV.
When one of your sheep gets diagnosed with an incurable disease, you have to worry. How quickly will symptoms develop, and welfare and productivity suffer? And how soon will it spread throughout the rest of the flock? The disease in question is maedi-visna (MV, see Box 1), notorious for its impact in Iceland, where the disease was first described; extreme measures over 20 years were required before it was finally eliminated. Culling seropositive animals is the main means of control. For the farmer, the crucial question is whether living with the disease would be more expensive than trying to eradicate it. We are addressing such questions by analysing data from long-term experiments.
1 MV – the tip of an iceberg?
Putting aside for a moment MV’s fearsome reputation, the way the pathogen works is fascinating. The small ruminant lentiviruses (SRLV, family retroviridae) are recognised as a heterogeneous group of viruses that infect sheep, goats and wild ruminants. Lentiviruses target the immune system, but SRLV does not target T-cells in the manner of immune deficiency lentiviruses such as HIV. Instead, SRLV infects monocytes (a type of white blood cell) which infiltrate the interstitial spaces of target organs (such as the lungs, mammary glands, or the synovial tissue of joints) carrying proviral DNA integrated into the host cell genome and hence invisible to the immune system. Virus replication commences following maturation of monocytes into macrophages, and the ensuing immune response eventually shows up as circulating antibodies (termed seroconversion). But it also causes inflammation that attracts further macrophages, slowly and progressively building into chronic inflammatory lesions and gross pathology. These take years to present clinical symptoms, hence the name lentivirus (from the Latin lentus, slow). By the time clinical signs become evident in a flock, the disease will have become well-established, with perhaps 30-70% of the flock infected. That is why MV is called one of the iceberg diseases of sheep – for every obviously affected individual, there are many others infected, but without apparent symptoms.
A large body of research into the pathology, immunology and molecular biology of small ruminant lentiviruses (SRLV) exists, as might be expected given its economic significance, welfare implications and its interest as a model for HIV. The main route of transmission of the virus is thought to be horizontal, via exhaled droplets of the highly infectious fluid from deep in the lungs of infected animals, suggesting a risk from prolonged close contact, for example in a sheep shed. But despite all the research into disease mechanisms, we were surprised to find that there has been almost no quantitative analysis of SRLV epidemiology, nor even an estimation of the rate of SRLV transmission under any management regime. So, our first foray into the data aimed to rectify this
We found an experiment published in 1987 with excellent detail on a five-year timecourse of seroconversions in a small infected sheep flock, and a further trawl of the Internet netted a PhD thesis that built on this with a focussed experiment. Karianne Lievaart-Peterson, its author, runs the Dutch sheep health scheme and became a collaborator. We also worked with Tom McNeilly, an immunologist at the Moredun Research Institute.
Nick Savill, a mathematical epidemiologist at Edinburgh University, did the hard work of developing and parameterising a mathematical model based on infectious disease epidemiology and a priori known and unknown aspects of SRLV biology. The model determines the probability of a susceptible ewe seroconverting when it did, and of a susceptible ewe not seroconverting before it was removed from the flock or the experiment ended. The product of these probabilities gives the likelihood of the data given the model. The model was prototyped in Python and then written in C for speed.
The striking result of this research is that MV is a disease of housing. Even brief periods of housing allow the virus to spread rapidly, but transmission is negligible between sheep kept on pasture So, although individual sheep never recover from the disease, it could be eliminated from flocks over time by exploiting the fact that transmission of the virus is too slow between grazing sheep to sustain the disease.
Our second striking result suggests the disease is unlikely to be spread by newly-infected animals, contrary to general expectation. We estimated that the time between an animal being infected and becoming infectious is about a year. This delay, termed epidemiological latency, is actually longer than the estimated time delay between infection and seroconversion.
We can now begin to see more clearly how disease processes occurring in the infected individual shape what happens at the flock, or epidemiological, level. It seems that, after a sheep becomes infected, the disease slowly progresses to the point when there is sufficient free virus to be recognised by the immune system, but then further development of inflammatory lesions in the lungs has to take place before there are sufficient quantities of infective alveolar macrophages and free virus for transmission by the respiratory route. There follows a further delay, perhaps of some years, before the disease has advanced to the stage at which symptoms such as chronic pneumonia and hardening of the udder become apparent.
Infectiousness is expected to be a function of viral load, and although we do not know the timecourse of viral load, it seems most likely that it continues to increase throughout the development of chronic disease. This suggests to us that the infectiousness of an individual is not constant, but is likely to increase as the disease progresses and symptoms emerge.
We are interested in learning how infectiousness changes over the course of an individual’s infection because of the implications at the epidemiological level. Time delays in seroconversion merely make the disease more difficult to detect and control, but the epidemiological significance of a time delay in the development of infectiousness is that it acts to slow the spread of the virus. And if ewes with long-standing infections are the most infectious, they pose the greatest risk to uninfected sheep. This would present an opportunity for the management of exposure to slow the spread of disease. For example, if ewes in their last year of production are the most infectious, then young ewes should be kept away from them when housed – an idea supported by preliminary analysis using individual-based modelling (IBM – see Box 2). Separation of younger animals from older ones may reduce the prevalence of infection to the point where the costs of disease, in terms of lost production and poor welfare, are not overwhelming or at least are less than the cost of attempting to eliminate the disease – we discuss this later in this blog.
So far, there is only very limited and tentative evidence of increasing infectiousness in the literature, and direct experimental evidence would be very hard to come by. But it is plausible that disease severity, viral load and impaired productivity are all related to the extent of inflammatory lesions in the lungs. This suggests that measurably-impaired productivity in infected sheep could be used as a proxy for viral load, and hence infectiousness. And that brings us to our current project.
2 Individual-based modelling
This is a technique to explore the consequences of probabilistic events, such as becoming infected by SRLV. The flock of ewes is modelled as individuals, and their progress through life is followed. Flock replacements are taken from the ewe lambs born to the flock; all other lambs being sold.
The figure shows the mean results (green line) of 1000 iterations of a stochastic simulation of SRLV prevalence in a flock of 400 ewes housed in groups of 100 for one month per year. The probability that an infected ewe will transmit the virus is modelled as rising exponentially with time since infection. The management regime we modelled was to segregate ewes during housing into each of their four age groups (2, 3, 4 and 5 years old) in separate pens, and to sell all the lambs of the oldest ewes, rather than retain any as flock replacements. From an initial starting prevalence of 275 infected ewes, the virus is virtually eliminated from the flock.
Eliminating SRLV from an infected flock involves either repeated testing of the whole flock, culling reactors and perhaps also artificially rearing lambs removed at birth, or entirely replacing the flock with accredited disease-free stock. So, the cost of eliminating the virus from a flock can be huge. But what about the costs of living with it? These costs arise from poor welfare leading to lost production: lactation failure, reduced lamb growth and excess lamb and ewe mortality. But under what conditions are they so substantial as to warrant an elimination strategy? That depends, again, on answers at two levels: what are the production losses for each infected ewe, and how prevalent is the disease in the flock?
We have a number of reasons to want to quantify how the costs of being SRLV+ vary over the time-course of the disease. First, it is reasonable to assume that production losses will be related to the emergence of symptoms in affected sheep, but this has never been adequately quantified. Second, if production losses are a function of the duration of infection, and we can regard them as a proxy for viral load, then it would support the idea that infectiousness also increases as the disease progresses. And third, if production losses are only apparent in sheep with long-standing infections, which is therefore restricted to older sheep, then management could focus on early detection of symptoms and culling of older ewes.
We are quantifying these processes using a large dataset from colleagues at the Lublin University of Life Sciences. Their six-year study was designed to assess the response of production parameters to SRLV infection in a flock of breeding sheep kept under standard Polish husbandry conditions. They published results suggesting that infection with SRLV was associated with higher rates of age-specific ewe mortality, especially in older ewes.
The data comprise lambing records for about 800 ewes from three breeds, with over 300 ewes being present each year and a few being present every year. There are also records from about 2800 lambs born during the trial. Ewes were blood-tested in November and June each year, and all SRLV+ ewes were housed together following the November test until the lambs were weaned in April. SRLV- ewes were housed in the same shed, but segregated from the SRLV+ group. We were able to group the ewes on the basis of the series of blood test results as: (1) seroconverted before the trial began, (2) had not seroconverted by the end, and (3) seroconverted during the trial and for whom a time since seroconversion can be estimated to within about six months.
Given the nature of the data - unbalanced design, multiple observations from individual ewes and rams over several years, and different breeds – we used Genstat to fit mixed models to distinguish random and fixed effects. We were given access to Genstat release 22 beta, which adds greater functionality for displaying and saving output, producing predictions and visualising the fit of the model.
The example below addresses pre-weaning lamb mortality (mort, either 0 or 1). We are using a generalized linear mixed model where significant fixed terms were added stepwise. The ewes and rams used to produce these lambs are obvious random terms because they can be regarded as being drawn at random from a large population. There also appears to be a strong ewe.ram interaction, with some combinations faring differently from others. We included ‘year’ as a random term because, over the six years in which data were collected, factors such as flock size and forage quality varied somewhat randomly.
The fixed terms show that the probability of mortality is strongly affected by lamb birthweight (lambbirthwt). A quadratic term (lb2) models the well-known reduction in lamb survival in very large lambs - a consequence of birth difficulties. The age of the ewe, fitted as a factor (eweageF), is the next most significant fixed effect, followed by the SRLV status of the ewe tested in November prior to lambing (ewetestNov). The interaction term of ewe age with SRLV status is highly significant, showing that the way the ageing process in ewes affects the probability of their lambs’ mortality differs according to SRLV status. From the table of back-transformed means, we see that the probability of lamb mortality ranges between about 0.02 to 0.04 in SRLV- ewes aged from 2 to 5 years, perhaps declining in older ewes. SRLV+ ewes show similar lamb mortality in ages 2-4, but a progressive increase as ewes age further, almost doubling each year.
This preliminary analysis provides some evidence that the costs of being infected by SRLV are, indeed, progressive with age. There is some way to go yet to show whether sheep with longer-standing SRLV infection have higher viral loads and are more infectious, but our current research does point to a way to potential better disease control by targeting older animals. Maedi-visna isn’t called a progressive disease for anything, and we should be able to exploit that.
We finally submitted our paper for publication in November 2019, just before the Covid 19 pandemic. One might have thought that a paper on the epidemiology and control of a respiratory virus spread by aerosol, especially indoors in close proximity and with a recommendation for social distancing, would have seemed quite topical. Ironically, early 2020 saw the subeditors and reviewers of such work all being urgently re-allocated to analysing information on the burgeoning pandemic. But we got there eventually and by October we were proudly writing a press release recommending social distancing … in sheep.
Andrew Illius writes, “My experience of Genstat dates back to the early 1980s when I think Release 3 was current. It was hard going, and we queued up to have our fault codes diagnosed at the Genstat Clinic. But having learnt Fortran programming in the punched cards era, I was used to it taking several days to get a job to run. Genstat’s exacting requirements were reassuring and it became indispensable over the following years of agricultural and ecological research. By the 1990s we had learnt that mixed models were required to account separately for random and fixed effects in unbalanced data, and I’d been on a REML course. I was especially proud to use REML as my main analytical procedure thereafter because Robin Thompson invented it just down the corridor where we work in the Zoology building at Edinburgh University, and where he worked with the Animal Breeding group. It’s been a tremendous pleasure to get back to Genstat recently after many years away – like greeting an old friend. In the past, I wrote and submitted batch jobs on a UNIX mainframe before collecting some line-printer output on my way home. Now things have really speeded up, with the menu-driven environment of the Windows version. It’s a fantastic improvement, and a pleasure to use.”
Andrew Illius is Emeritus Prof of Animal Ecology in the Institute of Evolutionary Biology, University of Edinburgh, where he taught animal production and animal ecology from 1978 to 2008 and was latterly Head of the School of Biological Sciences. Most of his work has been on the ecology and management of grazing systems and the ecophysiology and behaviour of grazing animals. He retired in 2008 to spend more time with his sheep, keeping about 400 breeding ewes. Familiarity with sheep diseases led to collaboration with Nick Savill since 2018 on the epidemiology and control of MV.
Nick Savill is a Senior Lecturer at the Institute of Immunology and Infection Research, University of Edinburgh. He teaches a range of quantitative skills to undergraduate biological science students including maths, stats, data analysis and coding. His research interests are in mathematical modelling of infectious disease epidemiology. He has worked on foot and mouth disease, avian influenza, malaria, trypanosomiasis and, most recently, maedi-visna with Andrew Illius.
Illius AW, Lievaart-Peterson K, McNeilly TN, Savill NJ (2020) Epidemiology and control of maedi-visna virus: Curing the flock. PLoS ONE 15 (9): e0238781. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238781
Kanchana Punyawaew01 March 2021
Linear mixed models: a balanced lattice square
This blog illustrates how to analyze data from a field experiment with a balanced lattice square design using linear mixed models. We’ll consider two models: the balanced lattice square model and a spatial model.
The example data are from a field experiment conducted at Slate Hall Farm, UK, in 1976 (Gilmour et al., 1995). The experiment was set up to compare the performance of 25 varieties of barley and was designed as a balanced lattice square with six replicates laid out in a 10 x 15 rectangular grid. Each replicate contained exactly one plot for every variety. The variety grown in each plot, and the coding of the replicates and lattice blocks, is shown in the field layout below:
There are seven columns in the data frame: five blocking factors (Rep, RowRep, ColRep, Row, Column), one treatment factor, Variety, and the response variate, yield.
The six replicates are numbered from 1 to 6 (Rep). The lattice block numbering is coded within replicates. That is, within each replicates the lattice rows (RowRep) and lattice columns (ColRep) are both numbered from 1 to 5. The Row and Column factors define the row and column positions within the field (rather than within each replicate).
To analyze the response variable, yield, we need to identify the two basic components of the experiment: the treatment structure and the blocking (or design) structure. The treatment structure consists of the set of treatments, or treatment combinations, selected to study or to compare. In our example, there is one treatment factor with 25 levels, Variety (i.e. the 25 different varieties of barley). The blocking structure of replicates (Rep), lattice rows within replicates (Rep:RowRep), and lattice columns within replicates (Rep:ColRep) reflects the balanced lattice square design. In a mixed model analysis, the treatment factors are (usually) fitted as fixed effects and the blocking factors as random.
The balanced lattice square model is fitted in ASReml-R4 using the following code:
> lattice.asr <- asreml(fixed = yield ~ Variety, random = ~ Rep + Rep:RowRep + Rep:ColRep, data=data1)
The REML log-likelihood is -707.786.
The model’s BIC is:
The estimated variance components are:
The table above contains the estimated variance components for all terms in the random model. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. The variance components for Rep, Rep:RowRep and Rep:ColRep are estimated as 4263, 15596, and 14813, respectively. As is typical, the largest unit (replicate) is more variable than its sub-units (lattice rows and columns within replicates). The "units!R" component is the residual variance.
By default, fixed effects in ASReml-R4 are tested using sequential Wald tests:
In this example, there are two terms in the summary table: the overall mean, (Intercept), and Variety. As the tests are sequential, the effect of the Variety is assessed by calculating the change in sums of squares between the two models (Intercept)+Variety and (Intercept). The p-value (Pr(Chisq)) of < 2.2 x 10-16 indicates that Variety is a highly significant.
The predicted means for the Variety can be obtained using the predict() function. The standard error of the difference between any pair of variety means is 62. Note: all variety means have the same standard error as the design is balanced.
Note: the same analysis is obtained when the random model is redefined as replicates (Rep), rows within replicates (Rep:Row) and columns within replicates (Rep:Column).
As the plots are laid out in a grid, the data can also be analyzed using a spatial model. We’ll illustrate spatial analysis by fitting a model with a separable first order autoregressive process in the field row (Row) and field column (Column) directions. This is often a useful model to start the spatial modeling process.
The separable first order autoregressive spatial model is fitted in ASReml-R4 using the following code:
> spatial.asr <- asreml(fixed = yield ~ Variety, residual = ~ar1(Row):ar1(Column), data = data1)
The BIC for this spatial model is:
The estimated variance components and sequential Wald tests are:
The residual variance is 38713, the estimated row correlation is 0.458, and the estimated column correlation is 0.684. As for the balanced lattice square model, there is strong evidence of a Variety effect (p-value < 2.2 x 10-16).
A log-likelihood ratio test cannot be used to compare the balanced lattice square model with the spatial models, as the variance models are not nested. However, the two models can be compared using BIC. As the spatial model has a smaller BIC (1415) than the balanced lattice square model (1435), of the two models explored in this blog, it is chosen as the preferred model. However, selecting the optimal spatial model can be difficult. The current spatial model can be extended by including measurement error (or nugget effect) or revised by selecting a different variance model for the spatial effects.
Butler, D.G., Cullis, B.R., Gilmour, A. R., Gogel, B.G. and Thompson, R. (2017). ASReml-R Reference Manual Version 4. VSN International Ltd, Hemel Hempstead, HP2 4TP UK.
Gilmour, A.R., Thompson, R. & Cullis, B.R. (1995). Average Information REML, an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51, 1440-1450.
Dr. Vanessa Cave13 December 2021
ANOVA, LM, LMM, GLM, GLMM, HGLM? Which statistical method should I use?
Unsure which statistical method is appropriate for your data set? Want to know how the different methods relate to each one another?
The simple diagram below may help you.
|Treatment factor||Categorical explanatory variable defining the treatment groups. In an experiment, the experimental units are randomly assigned to the different treatment groups (i.e., the levels of the treatment factor).|
|Blocking variable||Factor created during the design of the experiment whereby the experimental units are arranged in groups (i.e., blocks) that are similar to one another. You can learn more about blocking in the blog Using blocking to improve precision and avoid bias.|
|Continuous predictor||A numeric explanatory variable (x) used to predict changes in a response variable (y). Check out the blog Pearson correlation vs simple linear regression to learn more.|
|Unbalanced design||An experimental design is unbalanced if there are unequal sample sizes for the different treatments. Genstat provides users with a tool to automatically determine whether ANOVA, LM (i.e., regression) or LMM (i.e., a REML analysis) is most appropriate for a given data set. Watch this YouTube video to learn more.|
|Temporal correlation||Occurs when repeated measurements have been taken on the same experimental unit over time, and thus measurements closer in time are more similar to one another than those further apart. To learn more, check out our blog A brief introduction to modelling the correlation structure of repeated measures data.|
|Spatial correlation||Occurs when experimental units are laid out in a grid, for example in a field trial or greenhouse, and experimental units that are closer together experience more similar environmental conditions than those which are further apart. For more information, read our blog A brief look at spatial modelling.|
|Random effects||Represents the effect of a sample of conditions observed from some wider population, and it is the variability of the population that is of interest. The blog FAQ: Is it a fixed or random effect? can help you understand the difference between fixed and random effects.|
Dr Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems. Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.
Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.