The VSNi Team
29 June 2022Most breeding programs are supported by an array of genomic information that will provide new options to increase the rates of genetic gain. However, performing statistical analyses with molecular data can be a difficult task. This type of data has to communicate properly with available phenotypic and pedigree data. The overall success of this integration depends on a set of checks, verifications, filters, and careful preparation of all these datasets in order to be able to fit genetic models successfully and to obtain the required output to make correct decisions.
The workflow of molecular data-driven analysis varies based on the source of the datasets and of course, on personal preferences. Nevertheless, regardless of these aspects, an efficient genomics pipeline should rely on answering some of the following questions:
We developed ASRgenomics to help deal with the above questions. This is a free to use R library which can be downloaded from the ASReml knowledgbase. It is a compilation of proven routines developed over several years of study and hands-on experience in the field. ASRgenomics was built with advanced statistical modeling in mind and it fills a gap by helping you make sure your analyses are as efficient and accurate as they can be with several explicit diagnostic tools.
The package is aimed at geneticists and breeders with the purpose of improving their experience with genomic analyses, such as Genomic Selection (GS) and Genome Wide Association Studies (GWAS), in a straightforward and efficient manner. The main capabilities of the package include:
The functions included within ASRgenomics are very flexible and can be used for a tailored workflow from raw molecular data to well-behaved model-ready matrices. Additionally, ASRgenomics is capable of seamlessly preparing genomic datasets for integration with ASReml-R to fit linear mixed models (LMMs; e.g., GBLUP or ssGBLUP).
Please try this free library and check out the user guide included withinin the doc folder inside the download package for a walk-though of the features along with details of the methods.
Related Reads
Dr. Andrew Illius and Dr. Nick Savill
20 July 2022Quantification holds the key to controlling disease
Background
Andrew Illius with Nick Savill have, since 2018, studied the epidemiology and control of maedi-visna virus (MV) in sheep and have been looking at understanding and finding ways of controlling this incurable disease. Accessing published data and with the use of Genstat, they aimed to find ways of controlling MV.
When one of your sheep gets diagnosed with an incurable disease, you have to worry. How quickly will symptoms develop, and welfare and productivity suffer? And how soon will it spread throughout the rest of the flock? The disease in question is maedi-visna (MV, see Box 1), notorious for its impact in Iceland, where the disease was first described; extreme measures over 20 years were required before it was finally eliminated. Culling seropositive animals is the main means of control. For the farmer, the crucial question is whether living with the disease would be more expensive than trying to eradicate it. We are addressing such questions by analysing data from long-term experiments.
1 MV – the tip of an iceberg?Putting aside for a moment MV’s fearsome reputation, the way the pathogen works is fascinating. The small ruminant lentiviruses (SRLV, family retroviridae) are recognised as a heterogeneous group of viruses that infect sheep, goats and wild ruminants. Lentiviruses target the immune system, but SRLV does not target T-cells in the manner of immune deficiency lentiviruses such as HIV. Instead, SRLV infects monocytes (a type of white blood cell) which infiltrate the interstitial spaces of target organs (such as the lungs, mammary glands, or the synovial tissue of joints) carrying proviral DNA integrated into the host cell genome and hence invisible to the immune system. Virus replication commences following maturation of monocytes into macrophages, and the ensuing immune response eventually shows up as circulating antibodies (termed seroconversion). But it also causes inflammation that attracts further macrophages, slowly and progressively building into chronic inflammatory lesions and gross pathology. These take years to present clinical symptoms, hence the name lentivirus (from the Latin lentus, slow). By the time clinical signs become evident in a flock, the disease will have become well-established, with perhaps 30-70% of the flock infected. That is why MV is called one of the iceberg diseases of sheep – for every obviously affected individual, there are many others infected, but without apparent symptoms. |
A large body of research into the pathology, immunology and molecular biology of small ruminant lentiviruses (SRLV) exists, as might be expected given its economic significance, welfare implications and its interest as a model for HIV. The main route of transmission of the virus is thought to be horizontal, via exhaled droplets of the highly infectious fluid from deep in the lungs of infected animals, suggesting a risk from prolonged close contact, for example in a sheep shed. But despite all the research into disease mechanisms, we were surprised to find that there has been almost no quantitative analysis of SRLV epidemiology, nor even an estimation of the rate of SRLV transmission under any management regime. So, our first foray into the data aimed to rectify this
We found an experiment published in 1987 with excellent detail on a five-year timecourse of seroconversions in a small infected sheep flock, and a further trawl of the Internet netted a PhD thesis that built on this with a focussed experiment. Karianne Lievaart-Peterson, its author, runs the Dutch sheep health scheme and became a collaborator. We also worked with Tom McNeilly, an immunologist at the Moredun Research Institute.
Nick Savill, a mathematical epidemiologist at Edinburgh University, did the hard work of developing and parameterising a mathematical model based on infectious disease epidemiology and a priori known and unknown aspects of SRLV biology. The model determines the probability of a susceptible ewe seroconverting when it did, and of a susceptible ewe not seroconverting before it was removed from the flock or the experiment ended. The product of these probabilities gives the likelihood of the data given the model. The model was prototyped in Python and then written in C for speed.
The striking result of this research is that MV is a disease of housing. Even brief periods of housing allow the virus to spread rapidly, but transmission is negligible between sheep kept on pasture So, although individual sheep never recover from the disease, it could be eliminated from flocks over time by exploiting the fact that transmission of the virus is too slow between grazing sheep to sustain the disease.
Our second striking result suggests the disease is unlikely to be spread by newly-infected animals, contrary to general expectation. We estimated that the time between an animal being infected and becoming infectious is about a year. This delay, termed epidemiological latency, is actually longer than the estimated time delay between infection and seroconversion.
We can now begin to see more clearly how disease processes occurring in the infected individual shape what happens at the flock, or epidemiological, level. It seems that, after a sheep becomes infected, the disease slowly progresses to the point when there is sufficient free virus to be recognised by the immune system, but then further development of inflammatory lesions in the lungs has to take place before there are sufficient quantities of infective alveolar macrophages and free virus for transmission by the respiratory route. There follows a further delay, perhaps of some years, before the disease has advanced to the stage at which symptoms such as chronic pneumonia and hardening of the udder become apparent.
Infectiousness is expected to be a function of viral load, and although we do not know the timecourse of viral load, it seems most likely that it continues to increase throughout the development of chronic disease. This suggests to us that the infectiousness of an individual is not constant, but is likely to increase as the disease progresses and symptoms emerge.
We are interested in learning how infectiousness changes over the course of an individual’s infection because of the implications at the epidemiological level. Time delays in seroconversion merely make the disease more difficult to detect and control, but the epidemiological significance of a time delay in the development of infectiousness is that it acts to slow the spread of the virus. And if ewes with long-standing infections are the most infectious, they pose the greatest risk to uninfected sheep. This would present an opportunity for the management of exposure to slow the spread of disease. For example, if ewes in their last year of production are the most infectious, then young ewes should be kept away from them when housed – an idea supported by preliminary analysis using individual-based modelling (IBM – see Box 2). Separation of younger animals from older ones may reduce the prevalence of infection to the point where the costs of disease, in terms of lost production and poor welfare, are not overwhelming or at least are less than the cost of attempting to eliminate the disease – we discuss this later in this blog.
So far, there is only very limited and tentative evidence of increasing infectiousness in the literature, and direct experimental evidence would be very hard to come by. But it is plausible that disease severity, viral load and impaired productivity are all related to the extent of inflammatory lesions in the lungs. This suggests that measurably-impaired productivity in infected sheep could be used as a proxy for viral load, and hence infectiousness. And that brings us to our current project.
2 Individual-based modellingThis is a technique to explore the consequences of probabilistic events, such as becoming infected by SRLV. The flock of ewes is modelled as individuals, and their progress through life is followed. Flock replacements are taken from the ewe lambs born to the flock; all other lambs being sold. The figure shows the mean results (green line) of 1000 iterations of a stochastic simulation of SRLV prevalence in a flock of 400 ewes housed in groups of 100 for one month per year. The probability that an infected ewe will transmit the virus is modelled as rising exponentially with time since infection. The management regime we modelled was to segregate ewes during housing into each of their four age groups (2, 3, 4 and 5 years old) in separate pens, and to sell all the lambs of the oldest ewes, rather than retain any as flock replacements. From an initial starting prevalence of 275 infected ewes, the virus is virtually eliminated from the flock. |
Eliminating SRLV from an infected flock involves either repeated testing of the whole flock, culling reactors and perhaps also artificially rearing lambs removed at birth, or entirely replacing the flock with accredited disease-free stock. So, the cost of eliminating the virus from a flock can be huge. But what about the costs of living with it? These costs arise from poor welfare leading to lost production: lactation failure, reduced lamb growth and excess lamb and ewe mortality. But under what conditions are they so substantial as to warrant an elimination strategy? That depends, again, on answers at two levels: what are the production losses for each infected ewe, and how prevalent is the disease in the flock?
We have a number of reasons to want to quantify how the costs of being SRLV+ vary over the time-course of the disease. First, it is reasonable to assume that production losses will be related to the emergence of symptoms in affected sheep, but this has never been adequately quantified. Second, if production losses are a function of the duration of infection, and we can regard them as a proxy for viral load, then it would support the idea that infectiousness also increases as the disease progresses. And third, if production losses are only apparent in sheep with long-standing infections, which is therefore restricted to older sheep, then management could focus on early detection of symptoms and culling of older ewes.
We are quantifying these processes using a large dataset from colleagues at the Lublin University of Life Sciences. Their six-year study was designed to assess the response of production parameters to SRLV infection in a flock of breeding sheep kept under standard Polish husbandry conditions. They published results suggesting that infection with SRLV was associated with higher rates of age-specific ewe mortality, especially in older ewes.
The data comprise lambing records for about 800 ewes from three breeds, with over 300 ewes being present each year and a few being present every year. There are also records from about 2800 lambs born during the trial. Ewes were blood-tested in November and June each year, and all SRLV+ ewes were housed together following the November test until the lambs were weaned in April. SRLV- ewes were housed in the same shed, but segregated from the SRLV+ group. We were able to group the ewes on the basis of the series of blood test results as: (1) seroconverted before the trial began, (2) had not seroconverted by the end, and (3) seroconverted during the trial and for whom a time since seroconversion can be estimated to within about six months.
Given the nature of the data - unbalanced design, multiple observations from individual ewes and rams over several years, and different breeds – we used Genstat to fit mixed models to distinguish random and fixed effects. We were given access to Genstat release 22 beta, which adds greater functionality for displaying and saving output, producing predictions and visualising the fit of the model.
The example below addresses pre-weaning lamb mortality (mort, either 0 or 1). We are using a generalized linear mixed model where significant fixed terms were added stepwise. The ewes and rams used to produce these lambs are obvious random terms because they can be regarded as being drawn at random from a large population. There also appears to be a strong ewe.ram interaction, with some combinations faring differently from others. We included ‘year’ as a random term because, over the six years in which data were collected, factors such as flock size and forage quality varied somewhat randomly.
The fixed terms show that the probability of mortality is strongly affected by lamb birthweight (lambbirthwt). A quadratic term (lb2) models the well-known reduction in lamb survival in very large lambs - a consequence of birth difficulties. The age of the ewe, fitted as a factor (eweageF), is the next most significant fixed effect, followed by the SRLV status of the ewe tested in November prior to lambing (ewetestNov). The interaction term of ewe age with SRLV status is highly significant, showing that the way the ageing process in ewes affects the probability of their lambs’ mortality differs according to SRLV status. From the table of back-transformed means, we see that the probability of lamb mortality ranges between about 0.02 to 0.04 in SRLV- ewes aged from 2 to 5 years, perhaps declining in older ewes. SRLV+ ewes show similar lamb mortality in ages 2-4, but a progressive increase as ewes age further, almost doubling each year.
This preliminary analysis provides some evidence that the costs of being infected by SRLV are, indeed, progressive with age. There is some way to go yet to show whether sheep with longer-standing SRLV infection have higher viral loads and are more infectious, but our current research does point to a way to potential better disease control by targeting older animals. Maedi-visna isn’t called a progressive disease for anything, and we should be able to exploit that.
We finally submitted our paper for publication in November 2019, just before the Covid 19 pandemic. One might have thought that a paper on the epidemiology and control of a respiratory virus spread by aerosol, especially indoors in close proximity and with a recommendation for social distancing, would have seemed quite topical. Ironically, early 2020 saw the subeditors and reviewers of such work all being urgently re-allocated to analysing information on the burgeoning pandemic. But we got there eventually and by October we were proudly writing a press release recommending social distancing … in sheep.
Andrew Illius writes, “My experience of Genstat dates back to the early 1980s when I think Release 3 was current. It was hard going, and we queued up to have our fault codes diagnosed at the Genstat Clinic. But having learnt Fortran programming in the punched cards era, I was used to it taking several days to get a job to run. Genstat’s exacting requirements were reassuring and it became indispensable over the following years of agricultural and ecological research. By the 1990s we had learnt that mixed models were required to account separately for random and fixed effects in unbalanced data, and I’d been on a REML course. I was especially proud to use REML as my main analytical procedure thereafter because Robin Thompson invented it just down the corridor where we work in the Zoology building at Edinburgh University, and where he worked with the Animal Breeding group. It’s been a tremendous pleasure to get back to Genstat recently after many years away – like greeting an old friend. In the past, I wrote and submitted batch jobs on a UNIX mainframe before collecting some line-printer output on my way home. Now things have really speeded up, with the menu-driven environment of the Windows version. It’s a fantastic improvement, and a pleasure to use.”
Andrew Illius is Emeritus Prof of Animal Ecology in the Institute of Evolutionary Biology, University of Edinburgh, where he taught animal production and animal ecology from 1978 to 2008 and was latterly Head of the School of Biological Sciences. Most of his work has been on the ecology and management of grazing systems and the ecophysiology and behaviour of grazing animals. He retired in 2008 to spend more time with his sheep, keeping about 400 breeding ewes. Familiarity with sheep diseases led to collaboration with Nick Savill since 2018 on the epidemiology and control of MV.
Nick Savill is a Senior Lecturer at the Institute of Immunology and Infection Research, University of Edinburgh. He teaches a range of quantitative skills to undergraduate biological science students including maths, stats, data analysis and coding. His research interests are in mathematical modelling of infectious disease epidemiology. He has worked on foot and mouth disease, avian influenza, malaria, trypanosomiasis and, most recently, maedi-visna with Andrew Illius.
Illius AW, Lievaart-Peterson K, McNeilly TN, Savill NJ (2020) Epidemiology and control of maedi-visna virus: Curing the flock. PLoS ONE 15 (9): e0238781. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238781
Kanchana Punyawaew
01 March 2021Linear mixed models: a balanced lattice square
This blog illustrates how to analyze data from a field experiment with a balanced lattice square design using linear mixed models. We’ll consider two models: the balanced lattice square model and a spatial model.
The example data are from a field experiment conducted at Slate Hall Farm, UK, in 1976 (Gilmour et al., 1995). The experiment was set up to compare the performance of 25 varieties of barley and was designed as a balanced lattice square with six replicates laid out in a 10 x 15 rectangular grid. Each replicate contained exactly one plot for every variety. The variety grown in each plot, and the coding of the replicates and lattice blocks, is shown in the field layout below:
There are seven columns in the data frame: five blocking factors (Rep, RowRep, ColRep, Row, Column), one treatment factor, Variety, and the response variate, yield.
The six replicates are numbered from 1 to 6 (Rep). The lattice block numbering is coded within replicates. That is, within each replicates the lattice rows (RowRep) and lattice columns (ColRep) are both numbered from 1 to 5. The Row and Column factors define the row and column positions within the field (rather than within each replicate).
To analyze the response variable, yield, we need to identify the two basic components of the experiment: the treatment structure and the blocking (or design) structure. The treatment structure consists of the set of treatments, or treatment combinations, selected to study or to compare. In our example, there is one treatment factor with 25 levels, Variety (i.e. the 25 different varieties of barley). The blocking structure of replicates (Rep), lattice rows within replicates (Rep:RowRep), and lattice columns within replicates (Rep:ColRep) reflects the balanced lattice square design. In a mixed model analysis, the treatment factors are (usually) fitted as fixed effects and the blocking factors as random.
The balanced lattice square model is fitted in ASReml-R4 using the following code:
> lattice.asr <- asreml(fixed = yield ~ Variety,
random = ~ Rep + Rep:RowRep + Rep:ColRep,
data=data1)
The REML log-likelihood is -707.786.
The model’s BIC is:
The estimated variance components are:
The table above contains the estimated variance components for all terms in the random model. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. The variance components for Rep, Rep:RowRep and Rep:ColRep are estimated as 4263, 15596, and 14813, respectively. As is typical, the largest unit (replicate) is more variable than its sub-units (lattice rows and columns within replicates). The "units!R" component is the residual variance.
By default, fixed effects in ASReml-R4 are tested using sequential Wald tests:
In this example, there are two terms in the summary table: the overall mean, (Intercept), and Variety. As the tests are sequential, the effect of the Variety is assessed by calculating the change in sums of squares between the two models (Intercept)+Variety and (Intercept). The p-value (Pr(Chisq)) of < 2.2 x 10-16 indicates that Variety is a highly significant.
The predicted means for the Variety can be obtained using the predict() function. The standard error of the difference between any pair of variety means is 62. Note: all variety means have the same standard error as the design is balanced.
Note: the same analysis is obtained when the random model is redefined as replicates (Rep), rows within replicates (Rep:Row) and columns within replicates (Rep:Column).
As the plots are laid out in a grid, the data can also be analyzed using a spatial model. We’ll illustrate spatial analysis by fitting a model with a separable first order autoregressive process in the field row (Row) and field column (Column) directions. This is often a useful model to start the spatial modeling process.
The separable first order autoregressive spatial model is fitted in ASReml-R4 using the following code:
> spatial.asr <- asreml(fixed = yield ~ Variety,
residual = ~ar1(Row):ar1(Column),
data = data1)
The BIC for this spatial model is:
The estimated variance components and sequential Wald tests are:
The residual variance is 38713, the estimated row correlation is 0.458, and the estimated column correlation is 0.684. As for the balanced lattice square model, there is strong evidence of a Variety effect (p-value < 2.2 x 10-16).
A log-likelihood ratio test cannot be used to compare the balanced lattice square model with the spatial models, as the variance models are not nested. However, the two models can be compared using BIC. As the spatial model has a smaller BIC (1415) than the balanced lattice square model (1435), of the two models explored in this blog, it is chosen as the preferred model. However, selecting the optimal spatial model can be difficult. The current spatial model can be extended by including measurement error (or nugget effect) or revised by selecting a different variance model for the spatial effects.
Butler, D.G., Cullis, B.R., Gilmour, A. R., Gogel, B.G. and Thompson, R. (2017). ASReml-R Reference Manual Version 4. VSN International Ltd, Hemel Hempstead, HP2 4TP UK.
Gilmour, A.R., Thompson, R. & Cullis, B.R. (1995). Average Information REML, an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51, 1440-1450.