The VSNi Team16 February 2022
In a completely randomized design (CRD) treatments are randomly allocated to the experimental units. The randomization ensures that each treatment is equally likely to be assigned to any given experimental unit. For a balanced design, the replication (i.e., the number of experimental units) is the same for each treatment. However, a CRD is not required to be balanced to be analysed.
|When there are known differences between the experimental units, you can improve precision and avoid bias by blocking. To find out more, check out the blog Using blocking to improve precision and avoid bias.|
Randomization helps us guard against bias. In the figure below, five treatments have been randomly assigned to 50 experimental units using a balanced design (i.e., 10 experimental units per treatment). The orange and green outline colours represent some inherent, but unknown, differences between the experimental units. For example, in a plant trial, a difference in plant vigour, or in a medical trial, a physiological difference between patients. Randomization reduces bias by evening out the differences between experimental units.
Let’s find out how to analyse data from a CRD study.
Our example data are from an experiment in plant physiology, published by Sokal and Rohlf (1995). The lengths of pea sections (the dependent, or response, variable) grown in a tissue culture were recorded. The purpose of the experiment was to test the effects of various sugar media (the independent, or explanatory, variable) on mean pea section length. A balanced CRD was used with 10 replicates per treatment level.
The data are given in the table below:
Note: Prior to analysis we’ll need to arrange the data in long-format, with Treatment a factor with 5 levels (the 4 sugar treatments plus a control) and Length a variate.
Dot histograms of the data from each treatment indicate that:
➣ Mean pea section length is greatest for the Control treatment.
➣ Mean pea section length for the Sucrose treatment is greater than for the 3 other sugar medias.
➣ The Control data may be more variable than data from the sugar media treatments.
We’ll consider two different models for this CRD data set:
Model 1: Constant variance for all treatment groups.
This simple model comprises a fixed term for the treatments and a single residual (or error) variance.
Model 2: Unequal variance for the treatment groups.
This more complex model comprises a fixed term for the treatments and a separate residual variance for each treatment group.
# Model 1
# Model 2
REML [PRINT=model,components,means,waldTests; PSE=alldifferences] Length
VCOMPONENTS [FIXED=Treatment; EXPERIMENTS=Treatment]
REML [PRINT=model,components,means,waldTests; PSE=alldifferences] Length
|The estimated variance component (i.e., the residual variance) and its standard error are 5.46 ± 1.15.|
|The Wald test provides strong statistical evidence of a Treatment effect (P < 0.001),that is, the mean pea section length differs between the treatments.|
|The estimated means are 70.1, 59.3, 58.2, 58.0, and 64.1 in ocular units for Control, 2% Glucose, 2% Fructose, 1% Glucose + 1% Fructose, and 2% Sucrose, respectively. The standard error of the difference between treatment means is 1.045|
|For our inference to be valid, it is very important that we check the residuals for evidence of departures from the residual assumptions of normality and constant variance.|
➣ The histogram should have a reasonably symmetric bell-shape.
➣ The normal plot should form approximately a straight line.
For constant variance:
➣ The spread of the residuals in the scatterplots against fitted values and units should be roughly equal over the range of the data.
Here, there is some evidence of non-constant variance, in particular the residual variance appears to be greater for the Control treatment. Model 2 will allow for this unequal residual variance. Let’s look at the results from fitting Model 2:
Indicated by the yellow flag are the estimated residual variances for each treatment. We can see that the estimated variance for the Control treatment group is much higher than for the other treatments (albeit with a large standard error). Consequently, the standard error of the mean for the Control group is higher than the other treatments, as is its standard error of the differences. A formal comparison between Model 1 and 2 can be performed by using a likelihood ratio test, but this is the topic of another blog!
Sokal, R. R. and Rohlf, F. J. (1995). Biometry: the principles and practice of statistics in biological research. 3rd Edition, W.H. Freeman, New York, USA.
Dr. Andrew Illius and Dr. Nick Savill20 July 2022
Quantification holds the key to controlling disease
Andrew Illius with Nick Savill have, since 2018, studied the epidemiology and control of maedi-visna virus (MV) in sheep and have been looking at understanding and finding ways of controlling this incurable disease. Accessing published data and with the use of Genstat, they aimed to find ways of controlling MV.
When one of your sheep gets diagnosed with an incurable disease, you have to worry. How quickly will symptoms develop, and welfare and productivity suffer? And how soon will it spread throughout the rest of the flock? The disease in question is maedi-visna (MV, see Box 1), notorious for its impact in Iceland, where the disease was first described; extreme measures over 20 years were required before it was finally eliminated. Culling seropositive animals is the main means of control. For the farmer, the crucial question is whether living with the disease would be more expensive than trying to eradicate it. We are addressing such questions by analysing data from long-term experiments.
1 MV – the tip of an iceberg?
Putting aside for a moment MV’s fearsome reputation, the way the pathogen works is fascinating. The small ruminant lentiviruses (SRLV, family retroviridae) are recognised as a heterogeneous group of viruses that infect sheep, goats and wild ruminants. Lentiviruses target the immune system, but SRLV does not target T-cells in the manner of immune deficiency lentiviruses such as HIV. Instead, SRLV infects monocytes (a type of white blood cell) which infiltrate the interstitial spaces of target organs (such as the lungs, mammary glands, or the synovial tissue of joints) carrying proviral DNA integrated into the host cell genome and hence invisible to the immune system. Virus replication commences following maturation of monocytes into macrophages, and the ensuing immune response eventually shows up as circulating antibodies (termed seroconversion). But it also causes inflammation that attracts further macrophages, slowly and progressively building into chronic inflammatory lesions and gross pathology. These take years to present clinical symptoms, hence the name lentivirus (from the Latin lentus, slow). By the time clinical signs become evident in a flock, the disease will have become well-established, with perhaps 30-70% of the flock infected. That is why MV is called one of the iceberg diseases of sheep – for every obviously affected individual, there are many others infected, but without apparent symptoms.
A large body of research into the pathology, immunology and molecular biology of small ruminant lentiviruses (SRLV) exists, as might be expected given its economic significance, welfare implications and its interest as a model for HIV. The main route of transmission of the virus is thought to be horizontal, via exhaled droplets of the highly infectious fluid from deep in the lungs of infected animals, suggesting a risk from prolonged close contact, for example in a sheep shed. But despite all the research into disease mechanisms, we were surprised to find that there has been almost no quantitative analysis of SRLV epidemiology, nor even an estimation of the rate of SRLV transmission under any management regime. So, our first foray into the data aimed to rectify this
We found an experiment published in 1987 with excellent detail on a five-year timecourse of seroconversions in a small infected sheep flock, and a further trawl of the Internet netted a PhD thesis that built on this with a focussed experiment. Karianne Lievaart-Peterson, its author, runs the Dutch sheep health scheme and became a collaborator. We also worked with Tom McNeilly, an immunologist at the Moredun Research Institute.
Nick Savill, a mathematical epidemiologist at Edinburgh University, did the hard work of developing and parameterising a mathematical model based on infectious disease epidemiology and a priori known and unknown aspects of SRLV biology. The model determines the probability of a susceptible ewe seroconverting when it did, and of a susceptible ewe not seroconverting before it was removed from the flock or the experiment ended. The product of these probabilities gives the likelihood of the data given the model. The model was prototyped in Python and then written in C for speed.
The striking result of this research is that MV is a disease of housing. Even brief periods of housing allow the virus to spread rapidly, but transmission is negligible between sheep kept on pasture So, although individual sheep never recover from the disease, it could be eliminated from flocks over time by exploiting the fact that transmission of the virus is too slow between grazing sheep to sustain the disease.
Our second striking result suggests the disease is unlikely to be spread by newly-infected animals, contrary to general expectation. We estimated that the time between an animal being infected and becoming infectious is about a year. This delay, termed epidemiological latency, is actually longer than the estimated time delay between infection and seroconversion.
We can now begin to see more clearly how disease processes occurring in the infected individual shape what happens at the flock, or epidemiological, level. It seems that, after a sheep becomes infected, the disease slowly progresses to the point when there is sufficient free virus to be recognised by the immune system, but then further development of inflammatory lesions in the lungs has to take place before there are sufficient quantities of infective alveolar macrophages and free virus for transmission by the respiratory route. There follows a further delay, perhaps of some years, before the disease has advanced to the stage at which symptoms such as chronic pneumonia and hardening of the udder become apparent.
Infectiousness is expected to be a function of viral load, and although we do not know the timecourse of viral load, it seems most likely that it continues to increase throughout the development of chronic disease. This suggests to us that the infectiousness of an individual is not constant, but is likely to increase as the disease progresses and symptoms emerge.
We are interested in learning how infectiousness changes over the course of an individual’s infection because of the implications at the epidemiological level. Time delays in seroconversion merely make the disease more difficult to detect and control, but the epidemiological significance of a time delay in the development of infectiousness is that it acts to slow the spread of the virus. And if ewes with long-standing infections are the most infectious, they pose the greatest risk to uninfected sheep. This would present an opportunity for the management of exposure to slow the spread of disease. For example, if ewes in their last year of production are the most infectious, then young ewes should be kept away from them when housed – an idea supported by preliminary analysis using individual-based modelling (IBM – see Box 2). Separation of younger animals from older ones may reduce the prevalence of infection to the point where the costs of disease, in terms of lost production and poor welfare, are not overwhelming or at least are less than the cost of attempting to eliminate the disease – we discuss this later in this blog.
So far, there is only very limited and tentative evidence of increasing infectiousness in the literature, and direct experimental evidence would be very hard to come by. But it is plausible that disease severity, viral load and impaired productivity are all related to the extent of inflammatory lesions in the lungs. This suggests that measurably-impaired productivity in infected sheep could be used as a proxy for viral load, and hence infectiousness. And that brings us to our current project.
2 Individual-based modelling
This is a technique to explore the consequences of probabilistic events, such as becoming infected by SRLV. The flock of ewes is modelled as individuals, and their progress through life is followed. Flock replacements are taken from the ewe lambs born to the flock; all other lambs being sold.
The figure shows the mean results (green line) of 1000 iterations of a stochastic simulation of SRLV prevalence in a flock of 400 ewes housed in groups of 100 for one month per year. The probability that an infected ewe will transmit the virus is modelled as rising exponentially with time since infection. The management regime we modelled was to segregate ewes during housing into each of their four age groups (2, 3, 4 and 5 years old) in separate pens, and to sell all the lambs of the oldest ewes, rather than retain any as flock replacements. From an initial starting prevalence of 275 infected ewes, the virus is virtually eliminated from the flock.
Eliminating SRLV from an infected flock involves either repeated testing of the whole flock, culling reactors and perhaps also artificially rearing lambs removed at birth, or entirely replacing the flock with accredited disease-free stock. So, the cost of eliminating the virus from a flock can be huge. But what about the costs of living with it? These costs arise from poor welfare leading to lost production: lactation failure, reduced lamb growth and excess lamb and ewe mortality. But under what conditions are they so substantial as to warrant an elimination strategy? That depends, again, on answers at two levels: what are the production losses for each infected ewe, and how prevalent is the disease in the flock?
We have a number of reasons to want to quantify how the costs of being SRLV+ vary over the time-course of the disease. First, it is reasonable to assume that production losses will be related to the emergence of symptoms in affected sheep, but this has never been adequately quantified. Second, if production losses are a function of the duration of infection, and we can regard them as a proxy for viral load, then it would support the idea that infectiousness also increases as the disease progresses. And third, if production losses are only apparent in sheep with long-standing infections, which is therefore restricted to older sheep, then management could focus on early detection of symptoms and culling of older ewes.
We are quantifying these processes using a large dataset from colleagues at the Lublin University of Life Sciences. Their six-year study was designed to assess the response of production parameters to SRLV infection in a flock of breeding sheep kept under standard Polish husbandry conditions. They published results suggesting that infection with SRLV was associated with higher rates of age-specific ewe mortality, especially in older ewes.
The data comprise lambing records for about 800 ewes from three breeds, with over 300 ewes being present each year and a few being present every year. There are also records from about 2800 lambs born during the trial. Ewes were blood-tested in November and June each year, and all SRLV+ ewes were housed together following the November test until the lambs were weaned in April. SRLV- ewes were housed in the same shed, but segregated from the SRLV+ group. We were able to group the ewes on the basis of the series of blood test results as: (1) seroconverted before the trial began, (2) had not seroconverted by the end, and (3) seroconverted during the trial and for whom a time since seroconversion can be estimated to within about six months.
Given the nature of the data - unbalanced design, multiple observations from individual ewes and rams over several years, and different breeds – we used Genstat to fit mixed models to distinguish random and fixed effects. We were given access to Genstat release 22 beta, which adds greater functionality for displaying and saving output, producing predictions and visualising the fit of the model.
The example below addresses pre-weaning lamb mortality (mort, either 0 or 1). We are using a generalized linear mixed model where significant fixed terms were added stepwise. The ewes and rams used to produce these lambs are obvious random terms because they can be regarded as being drawn at random from a large population. There also appears to be a strong ewe.ram interaction, with some combinations faring differently from others. We included ‘year’ as a random term because, over the six years in which data were collected, factors such as flock size and forage quality varied somewhat randomly.
The fixed terms show that the probability of mortality is strongly affected by lamb birthweight (lambbirthwt). A quadratic term (lb2) models the well-known reduction in lamb survival in very large lambs - a consequence of birth difficulties. The age of the ewe, fitted as a factor (eweageF), is the next most significant fixed effect, followed by the SRLV status of the ewe tested in November prior to lambing (ewetestNov). The interaction term of ewe age with SRLV status is highly significant, showing that the way the ageing process in ewes affects the probability of their lambs’ mortality differs according to SRLV status. From the table of back-transformed means, we see that the probability of lamb mortality ranges between about 0.02 to 0.04 in SRLV- ewes aged from 2 to 5 years, perhaps declining in older ewes. SRLV+ ewes show similar lamb mortality in ages 2-4, but a progressive increase as ewes age further, almost doubling each year.
This preliminary analysis provides some evidence that the costs of being infected by SRLV are, indeed, progressive with age. There is some way to go yet to show whether sheep with longer-standing SRLV infection have higher viral loads and are more infectious, but our current research does point to a way to potential better disease control by targeting older animals. Maedi-visna isn’t called a progressive disease for anything, and we should be able to exploit that.
We finally submitted our paper for publication in November 2019, just before the Covid 19 pandemic. One might have thought that a paper on the epidemiology and control of a respiratory virus spread by aerosol, especially indoors in close proximity and with a recommendation for social distancing, would have seemed quite topical. Ironically, early 2020 saw the subeditors and reviewers of such work all being urgently re-allocated to analysing information on the burgeoning pandemic. But we got there eventually and by October we were proudly writing a press release recommending social distancing … in sheep.
Andrew Illius writes, “My experience of Genstat dates back to the early 1980s when I think Release 3 was current. It was hard going, and we queued up to have our fault codes diagnosed at the Genstat Clinic. But having learnt Fortran programming in the punched cards era, I was used to it taking several days to get a job to run. Genstat’s exacting requirements were reassuring and it became indispensable over the following years of agricultural and ecological research. By the 1990s we had learnt that mixed models were required to account separately for random and fixed effects in unbalanced data, and I’d been on a REML course. I was especially proud to use REML as my main analytical procedure thereafter because Robin Thompson invented it just down the corridor where we work in the Zoology building at Edinburgh University, and where he worked with the Animal Breeding group. It’s been a tremendous pleasure to get back to Genstat recently after many years away – like greeting an old friend. In the past, I wrote and submitted batch jobs on a UNIX mainframe before collecting some line-printer output on my way home. Now things have really speeded up, with the menu-driven environment of the Windows version. It’s a fantastic improvement, and a pleasure to use.”
Andrew Illius is Emeritus Prof of Animal Ecology in the Institute of Evolutionary Biology, University of Edinburgh, where he taught animal production and animal ecology from 1978 to 2008 and was latterly Head of the School of Biological Sciences. Most of his work has been on the ecology and management of grazing systems and the ecophysiology and behaviour of grazing animals. He retired in 2008 to spend more time with his sheep, keeping about 400 breeding ewes. Familiarity with sheep diseases led to collaboration with Nick Savill since 2018 on the epidemiology and control of MV.
Nick Savill is a Senior Lecturer at the Institute of Immunology and Infection Research, University of Edinburgh. He teaches a range of quantitative skills to undergraduate biological science students including maths, stats, data analysis and coding. His research interests are in mathematical modelling of infectious disease epidemiology. He has worked on foot and mouth disease, avian influenza, malaria, trypanosomiasis and, most recently, maedi-visna with Andrew Illius.
Illius AW, Lievaart-Peterson K, McNeilly TN, Savill NJ (2020) Epidemiology and control of maedi-visna virus: Curing the flock. PLoS ONE 15 (9): e0238781. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238781
Kanchana Punyawaew01 March 2021
Linear mixed models: a balanced lattice square
This blog illustrates how to analyze data from a field experiment with a balanced lattice square design using linear mixed models. We’ll consider two models: the balanced lattice square model and a spatial model.
The example data are from a field experiment conducted at Slate Hall Farm, UK, in 1976 (Gilmour et al., 1995). The experiment was set up to compare the performance of 25 varieties of barley and was designed as a balanced lattice square with six replicates laid out in a 10 x 15 rectangular grid. Each replicate contained exactly one plot for every variety. The variety grown in each plot, and the coding of the replicates and lattice blocks, is shown in the field layout below:
There are seven columns in the data frame: five blocking factors (Rep, RowRep, ColRep, Row, Column), one treatment factor, Variety, and the response variate, yield.
The six replicates are numbered from 1 to 6 (Rep). The lattice block numbering is coded within replicates. That is, within each replicates the lattice rows (RowRep) and lattice columns (ColRep) are both numbered from 1 to 5. The Row and Column factors define the row and column positions within the field (rather than within each replicate).
To analyze the response variable, yield, we need to identify the two basic components of the experiment: the treatment structure and the blocking (or design) structure. The treatment structure consists of the set of treatments, or treatment combinations, selected to study or to compare. In our example, there is one treatment factor with 25 levels, Variety (i.e. the 25 different varieties of barley). The blocking structure of replicates (Rep), lattice rows within replicates (Rep:RowRep), and lattice columns within replicates (Rep:ColRep) reflects the balanced lattice square design. In a mixed model analysis, the treatment factors are (usually) fitted as fixed effects and the blocking factors as random.
The balanced lattice square model is fitted in ASReml-R4 using the following code:
> lattice.asr <- asreml(fixed = yield ~ Variety, random = ~ Rep + Rep:RowRep + Rep:ColRep, data=data1)
The REML log-likelihood is -707.786.
The model’s BIC is:
The estimated variance components are:
The table above contains the estimated variance components for all terms in the random model. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. The variance components for Rep, Rep:RowRep and Rep:ColRep are estimated as 4263, 15596, and 14813, respectively. As is typical, the largest unit (replicate) is more variable than its sub-units (lattice rows and columns within replicates). The "units!R" component is the residual variance.
By default, fixed effects in ASReml-R4 are tested using sequential Wald tests:
In this example, there are two terms in the summary table: the overall mean, (Intercept), and Variety. As the tests are sequential, the effect of the Variety is assessed by calculating the change in sums of squares between the two models (Intercept)+Variety and (Intercept). The p-value (Pr(Chisq)) of < 2.2 x 10-16 indicates that Variety is a highly significant.
The predicted means for the Variety can be obtained using the predict() function. The standard error of the difference between any pair of variety means is 62. Note: all variety means have the same standard error as the design is balanced.
Note: the same analysis is obtained when the random model is redefined as replicates (Rep), rows within replicates (Rep:Row) and columns within replicates (Rep:Column).
As the plots are laid out in a grid, the data can also be analyzed using a spatial model. We’ll illustrate spatial analysis by fitting a model with a separable first order autoregressive process in the field row (Row) and field column (Column) directions. This is often a useful model to start the spatial modeling process.
The separable first order autoregressive spatial model is fitted in ASReml-R4 using the following code:
> spatial.asr <- asreml(fixed = yield ~ Variety, residual = ~ar1(Row):ar1(Column), data = data1)
The BIC for this spatial model is:
The estimated variance components and sequential Wald tests are:
The residual variance is 38713, the estimated row correlation is 0.458, and the estimated column correlation is 0.684. As for the balanced lattice square model, there is strong evidence of a Variety effect (p-value < 2.2 x 10-16).
A log-likelihood ratio test cannot be used to compare the balanced lattice square model with the spatial models, as the variance models are not nested. However, the two models can be compared using BIC. As the spatial model has a smaller BIC (1415) than the balanced lattice square model (1435), of the two models explored in this blog, it is chosen as the preferred model. However, selecting the optimal spatial model can be difficult. The current spatial model can be extended by including measurement error (or nugget effect) or revised by selecting a different variance model for the spatial effects.
Butler, D.G., Cullis, B.R., Gilmour, A. R., Gogel, B.G. and Thompson, R. (2017). ASReml-R Reference Manual Version 4. VSN International Ltd, Hemel Hempstead, HP2 4TP UK.
Gilmour, A.R., Thompson, R. & Cullis, B.R. (1995). Average Information REML, an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51, 1440-1450.
Dr. Vanessa Cave13 December 2021
ANOVA, LM, LMM, GLM, GLMM, HGLM? Which statistical method should I use?
Unsure which statistical method is appropriate for your data set? Want to know how the different methods relate to each one another?
The simple diagram below may help you.
|Treatment factor||Categorical explanatory variable defining the treatment groups. In an experiment, the experimental units are randomly assigned to the different treatment groups (i.e., the levels of the treatment factor).|
|Blocking variable||Factor created during the design of the experiment whereby the experimental units are arranged in groups (i.e., blocks) that are similar to one another. You can learn more about blocking in the blog Using blocking to improve precision and avoid bias.|
|Continuous predictor||A numeric explanatory variable (x) used to predict changes in a response variable (y). Check out the blog Pearson correlation vs simple linear regression to learn more.|
|Unbalanced design||An experimental design is unbalanced if there are unequal sample sizes for the different treatments. Genstat provides users with a tool to automatically determine whether ANOVA, LM (i.e., regression) or LMM (i.e., a REML analysis) is most appropriate for a given data set. Watch this YouTube video to learn more.|
|Temporal correlation||Occurs when repeated measurements have been taken on the same experimental unit over time, and thus measurements closer in time are more similar to one another than those further apart. To learn more, check out our blog A brief introduction to modelling the correlation structure of repeated measures data.|
|Spatial correlation||Occurs when experimental units are laid out in a grid, for example in a field trial or greenhouse, and experimental units that are closer together experience more similar environmental conditions than those which are further apart. For more information, read our blog A brief look at spatial modelling.|
|Random effects||Represents the effect of a sample of conditions observed from some wider population, and it is the variability of the population that is of interest. The blog FAQ: Is it a fixed or random effect? can help you understand the difference between fixed and random effects.|
Dr Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems. Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.
Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.