Dr. Vanessa Cave
27 July 2022Don’t add more data to make your results significant!
Recently I was reviewing a scientific publication and a certain segment in the Methods section caused me to sweat. In brief, the authors had performed an analysis on a data set with n observations. After discovering an interesting but non-significant effect, they had added a further m observations to their data set and re-analysed it to “increase the power”.
Before I delve in to why adding more data to achieve significance is an issue, firstly I would like to commend the authors on providing a complete description of their methods, and not glossing over important details such as the one above. Only by reporting our methodology accurately and completely can our research be properly scrutinized and independently reproduced, and our findings appropriately interpreted.
So, what’s the problem with adding more data to achieve statistical significance?
Well… in doing so you increase the probability of obtaining a significant effect by random chance. In other words, you increase the chance of a type I error (i.e., rejecting the null hypothesis when in fact it is true).
At the nub of the issue is the decision to add more data based on a known outcome. In particular, the addition of the m extra observations is conditional on the results from the analysis of the first n observations. You can read more about this in Professor James Curran’s excellent StatsChat blog entitled “Just one more…”. This blog illustrates the issue very nicely using a coin toss example and a horse racing analogy.
Say we’re interested in comparing two means: and . We construct an appropriate statistical test based on the following null () and alternative () hypotheses:
and assess the strength of evidence against the null hypothesis, as provided by our observed data, using a p-value. For simplicity, let’s assume a significance level (or type I error rate) of 5%. Our p-value is either:
Under such a testing scenario, the probability that we reject the null hypothesis when in fact it is true is 5%. That is, there is a 5% probability of accepting the alternative hypothesis when the results can be attributed to chance.
As the authors I mention above have done, let’s “change the rules” of our test by adding some more data if the difference in means is promisingly large (or analogously, the p-value is small, but not quite small enough), and then having another go at rejecting the null hypothesis. Crucially, in doing so the probability that we erroneously reject the null hypothesis is no longer 5% - in fact, it is larger! Intuitively, we can think of this as accumulating debt from our earlier test. Regardless of what happens next, there was a 5% probability of rejecting the null hypothesis by chance in the first analysis. And conversely, a 95% probability that the null hypotheses wasn’t rejected given that it is true. Adding more data, and re-analysing it doesn’t change these probabilities; we have already incurred a 5% type I error rate from the first analysis and this debt can’t be forgotten.
It’s perhaps more obvious to appreciate why sequentially adding more data until finally we have a data set that enables us to reject the null hypothesis is such a very bad thing to do. However, why adding a little bit more data to “give our analysis more power” is so wrong is possibly harder to wrap our heads around. But ultimately, it boils down to adding more data because we haven’t got the result we want. Hardly a fair approach!
Using the information from the initial analysis, the authors can conduct a power analysis to determine the sample size required to detect the effect of interest. Based on this sample size, the authors should then collect independent data (i.e., perform another experiment) and use this new data to test their hypothesis independently of the first data set.
Alternatively, the authors could have designed a priori a sequential analysis trial. Such an approach involves adjusting the significance level at each round of testing so that the overall type 1 error rate remains at the desired level. For example, for an overall type 1 error rate of 5%, and a maximum of two testing rounds, each test should be conducted at the 2.94% significance level (according to Pocock’s boundary). Sequential analysis is popular in clinical trials where, for ethical reasons, researchers want the option of stopping the trial early if the results from an initial group of patients provides the evidence they seek.
Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems. Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.
Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.
Dr. Andrew Illius and Dr. Nick Savill
20 July 2022Quantification holds the key to controlling disease
Background
Andrew Illius with Nick Savill have, since 2018, studied the epidemiology and control of maedi-visna virus (MV) in sheep and have been looking at understanding and finding ways of controlling this incurable disease. Accessing published data and with the use of Genstat, they aimed to find ways of controlling MV.
When one of your sheep gets diagnosed with an incurable disease, you have to worry. How quickly will symptoms develop, and welfare and productivity suffer? And how soon will it spread throughout the rest of the flock? The disease in question is maedi-visna (MV, see Box 1), notorious for its impact in Iceland, where the disease was first described; extreme measures over 20 years were required before it was finally eliminated. Culling seropositive animals is the main means of control. For the farmer, the crucial question is whether living with the disease would be more expensive than trying to eradicate it. We are addressing such questions by analysing data from long-term experiments.
1 MV – the tip of an iceberg?Putting aside for a moment MV’s fearsome reputation, the way the pathogen works is fascinating. The small ruminant lentiviruses (SRLV, family retroviridae) are recognised as a heterogeneous group of viruses that infect sheep, goats and wild ruminants. Lentiviruses target the immune system, but SRLV does not target T-cells in the manner of immune deficiency lentiviruses such as HIV. Instead, SRLV infects monocytes (a type of white blood cell) which infiltrate the interstitial spaces of target organs (such as the lungs, mammary glands, or the synovial tissue of joints) carrying proviral DNA integrated into the host cell genome and hence invisible to the immune system. Virus replication commences following maturation of monocytes into macrophages, and the ensuing immune response eventually shows up as circulating antibodies (termed seroconversion). But it also causes inflammation that attracts further macrophages, slowly and progressively building into chronic inflammatory lesions and gross pathology. These take years to present clinical symptoms, hence the name lentivirus (from the Latin lentus, slow). By the time clinical signs become evident in a flock, the disease will have become well-established, with perhaps 30-70% of the flock infected. That is why MV is called one of the iceberg diseases of sheep – for every obviously affected individual, there are many others infected, but without apparent symptoms. |
A large body of research into the pathology, immunology and molecular biology of small ruminant lentiviruses (SRLV) exists, as might be expected given its economic significance, welfare implications and its interest as a model for HIV. The main route of transmission of the virus is thought to be horizontal, via exhaled droplets of the highly infectious fluid from deep in the lungs of infected animals, suggesting a risk from prolonged close contact, for example in a sheep shed. But despite all the research into disease mechanisms, we were surprised to find that there has been almost no quantitative analysis of SRLV epidemiology, nor even an estimation of the rate of SRLV transmission under any management regime. So, our first foray into the data aimed to rectify this
We found an experiment published in 1987 with excellent detail on a five-year timecourse of seroconversions in a small infected sheep flock, and a further trawl of the Internet netted a PhD thesis that built on this with a focussed experiment. Karianne Lievaart-Peterson, its author, runs the Dutch sheep health scheme and became a collaborator. We also worked with Tom McNeilly, an immunologist at the Moredun Research Institute.
Nick Savill, a mathematical epidemiologist at Edinburgh University, did the hard work of developing and parameterising a mathematical model based on infectious disease epidemiology and a priori known and unknown aspects of SRLV biology. The model determines the probability of a susceptible ewe seroconverting when it did, and of a susceptible ewe not seroconverting before it was removed from the flock or the experiment ended. The product of these probabilities gives the likelihood of the data given the model. The model was prototyped in Python and then written in C for speed.
The striking result of this research is that MV is a disease of housing. Even brief periods of housing allow the virus to spread rapidly, but transmission is negligible between sheep kept on pasture So, although individual sheep never recover from the disease, it could be eliminated from flocks over time by exploiting the fact that transmission of the virus is too slow between grazing sheep to sustain the disease.
Our second striking result suggests the disease is unlikely to be spread by newly-infected animals, contrary to general expectation. We estimated that the time between an animal being infected and becoming infectious is about a year. This delay, termed epidemiological latency, is actually longer than the estimated time delay between infection and seroconversion.
We can now begin to see more clearly how disease processes occurring in the infected individual shape what happens at the flock, or epidemiological, level. It seems that, after a sheep becomes infected, the disease slowly progresses to the point when there is sufficient free virus to be recognised by the immune system, but then further development of inflammatory lesions in the lungs has to take place before there are sufficient quantities of infective alveolar macrophages and free virus for transmission by the respiratory route. There follows a further delay, perhaps of some years, before the disease has advanced to the stage at which symptoms such as chronic pneumonia and hardening of the udder become apparent.
Infectiousness is expected to be a function of viral load, and although we do not know the timecourse of viral load, it seems most likely that it continues to increase throughout the development of chronic disease. This suggests to us that the infectiousness of an individual is not constant, but is likely to increase as the disease progresses and symptoms emerge.
We are interested in learning how infectiousness changes over the course of an individual’s infection because of the implications at the epidemiological level. Time delays in seroconversion merely make the disease more difficult to detect and control, but the epidemiological significance of a time delay in the development of infectiousness is that it acts to slow the spread of the virus. And if ewes with long-standing infections are the most infectious, they pose the greatest risk to uninfected sheep. This would present an opportunity for the management of exposure to slow the spread of disease. For example, if ewes in their last year of production are the most infectious, then young ewes should be kept away from them when housed – an idea supported by preliminary analysis using individual-based modelling (IBM – see Box 2). Separation of younger animals from older ones may reduce the prevalence of infection to the point where the costs of disease, in terms of lost production and poor welfare, are not overwhelming or at least are less than the cost of attempting to eliminate the disease – we discuss this later in this blog.
So far, there is only very limited and tentative evidence of increasing infectiousness in the literature, and direct experimental evidence would be very hard to come by. But it is plausible that disease severity, viral load and impaired productivity are all related to the extent of inflammatory lesions in the lungs. This suggests that measurably-impaired productivity in infected sheep could be used as a proxy for viral load, and hence infectiousness. And that brings us to our current project.
2 Individual-based modellingThis is a technique to explore the consequences of probabilistic events, such as becoming infected by SRLV. The flock of ewes is modelled as individuals, and their progress through life is followed. Flock replacements are taken from the ewe lambs born to the flock; all other lambs being sold. The figure shows the mean results (green line) of 1000 iterations of a stochastic simulation of SRLV prevalence in a flock of 400 ewes housed in groups of 100 for one month per year. The probability that an infected ewe will transmit the virus is modelled as rising exponentially with time since infection. The management regime we modelled was to segregate ewes during housing into each of their four age groups (2, 3, 4 and 5 years old) in separate pens, and to sell all the lambs of the oldest ewes, rather than retain any as flock replacements. From an initial starting prevalence of 275 infected ewes, the virus is virtually eliminated from the flock. |
Eliminating SRLV from an infected flock involves either repeated testing of the whole flock, culling reactors and perhaps also artificially rearing lambs removed at birth, or entirely replacing the flock with accredited disease-free stock. So, the cost of eliminating the virus from a flock can be huge. But what about the costs of living with it? These costs arise from poor welfare leading to lost production: lactation failure, reduced lamb growth and excess lamb and ewe mortality. But under what conditions are they so substantial as to warrant an elimination strategy? That depends, again, on answers at two levels: what are the production losses for each infected ewe, and how prevalent is the disease in the flock?
We have a number of reasons to want to quantify how the costs of being SRLV+ vary over the time-course of the disease. First, it is reasonable to assume that production losses will be related to the emergence of symptoms in affected sheep, but this has never been adequately quantified. Second, if production losses are a function of the duration of infection, and we can regard them as a proxy for viral load, then it would support the idea that infectiousness also increases as the disease progresses. And third, if production losses are only apparent in sheep with long-standing infections, which is therefore restricted to older sheep, then management could focus on early detection of symptoms and culling of older ewes.
We are quantifying these processes using a large dataset from colleagues at the Lublin University of Life Sciences. Their six-year study was designed to assess the response of production parameters to SRLV infection in a flock of breeding sheep kept under standard Polish husbandry conditions. They published results suggesting that infection with SRLV was associated with higher rates of age-specific ewe mortality, especially in older ewes.
The data comprise lambing records for about 800 ewes from three breeds, with over 300 ewes being present each year and a few being present every year. There are also records from about 2800 lambs born during the trial. Ewes were blood-tested in November and June each year, and all SRLV+ ewes were housed together following the November test until the lambs were weaned in April. SRLV- ewes were housed in the same shed, but segregated from the SRLV+ group. We were able to group the ewes on the basis of the series of blood test results as: (1) seroconverted before the trial began, (2) had not seroconverted by the end, and (3) seroconverted during the trial and for whom a time since seroconversion can be estimated to within about six months.
Given the nature of the data - unbalanced design, multiple observations from individual ewes and rams over several years, and different breeds – we used Genstat to fit mixed models to distinguish random and fixed effects. We were given access to Genstat release 22 beta, which adds greater functionality for displaying and saving output, producing predictions and visualising the fit of the model.
The example below addresses pre-weaning lamb mortality (mort, either 0 or 1). We are using a generalized linear mixed model where significant fixed terms were added stepwise. The ewes and rams used to produce these lambs are obvious random terms because they can be regarded as being drawn at random from a large population. There also appears to be a strong ewe.ram interaction, with some combinations faring differently from others. We included ‘year’ as a random term because, over the six years in which data were collected, factors such as flock size and forage quality varied somewhat randomly.
The fixed terms show that the probability of mortality is strongly affected by lamb birthweight (lambbirthwt). A quadratic term (lb2) models the well-known reduction in lamb survival in very large lambs - a consequence of birth difficulties. The age of the ewe, fitted as a factor (eweageF), is the next most significant fixed effect, followed by the SRLV status of the ewe tested in November prior to lambing (ewetestNov). The interaction term of ewe age with SRLV status is highly significant, showing that the way the ageing process in ewes affects the probability of their lambs’ mortality differs according to SRLV status. From the table of back-transformed means, we see that the probability of lamb mortality ranges between about 0.02 to 0.04 in SRLV- ewes aged from 2 to 5 years, perhaps declining in older ewes. SRLV+ ewes show similar lamb mortality in ages 2-4, but a progressive increase as ewes age further, almost doubling each year.
This preliminary analysis provides some evidence that the costs of being infected by SRLV are, indeed, progressive with age. There is some way to go yet to show whether sheep with longer-standing SRLV infection have higher viral loads and are more infectious, but our current research does point to a way to potential better disease control by targeting older animals. Maedi-visna isn’t called a progressive disease for anything, and we should be able to exploit that.
We finally submitted our paper for publication in November 2019, just before the Covid 19 pandemic. One might have thought that a paper on the epidemiology and control of a respiratory virus spread by aerosol, especially indoors in close proximity and with a recommendation for social distancing, would have seemed quite topical. Ironically, early 2020 saw the subeditors and reviewers of such work all being urgently re-allocated to analysing information on the burgeoning pandemic. But we got there eventually and by October we were proudly writing a press release recommending social distancing … in sheep.
Andrew Illius writes, “My experience of Genstat dates back to the early 1980s when I think Release 3 was current. It was hard going, and we queued up to have our fault codes diagnosed at the Genstat Clinic. But having learnt Fortran programming in the punched cards era, I was used to it taking several days to get a job to run. Genstat’s exacting requirements were reassuring and it became indispensable over the following years of agricultural and ecological research. By the 1990s we had learnt that mixed models were required to account separately for random and fixed effects in unbalanced data, and I’d been on a REML course. I was especially proud to use REML as my main analytical procedure thereafter because Robin Thompson invented it just down the corridor where we work in the Zoology building at Edinburgh University, and where he worked with the Animal Breeding group. It’s been a tremendous pleasure to get back to Genstat recently after many years away – like greeting an old friend. In the past, I wrote and submitted batch jobs on a UNIX mainframe before collecting some line-printer output on my way home. Now things have really speeded up, with the menu-driven environment of the Windows version. It’s a fantastic improvement, and a pleasure to use.”
Andrew Illius is Emeritus Prof of Animal Ecology in the Institute of Evolutionary Biology, University of Edinburgh, where he taught animal production and animal ecology from 1978 to 2008 and was latterly Head of the School of Biological Sciences. Most of his work has been on the ecology and management of grazing systems and the ecophysiology and behaviour of grazing animals. He retired in 2008 to spend more time with his sheep, keeping about 400 breeding ewes. Familiarity with sheep diseases led to collaboration with Nick Savill since 2018 on the epidemiology and control of MV.
Nick Savill is a Senior Lecturer at the Institute of Immunology and Infection Research, University of Edinburgh. He teaches a range of quantitative skills to undergraduate biological science students including maths, stats, data analysis and coding. His research interests are in mathematical modelling of infectious disease epidemiology. He has worked on foot and mouth disease, avian influenza, malaria, trypanosomiasis and, most recently, maedi-visna with Andrew Illius.
Illius AW, Lievaart-Peterson K, McNeilly TN, Savill NJ (2020) Epidemiology and control of maedi-visna virus: Curing the flock. PLoS ONE 15 (9): e0238781. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238781
Dr. David Baird
13 July 2022Genstat 22 – What’s new, for you
Equivalence tests, Linear mixed models for censored data, permutation tests for fixed effects in a GLMM and greater functionality: Genstat 22 brings a wealth of new features and benefits.
Genstat 22 has made it easy to perform equivalence, non-inferiority and non-superiority tests: extremely useful tests when the aim of the study is to demonstrate that two treatments are effectively the same, or that one treatment is neither inferior or superior to another. For example, in a medical trial when the aim is to prove that a new drug treatment is just as effective as the standard drug, or in the plant breeding context when the aim is to show that a new cultivar is at least as resistant to disease as the standard industry variety.
Why are these tests so important? In our typical statistical hypothesis testing scenario, the null hypothesis is that of no effect, or there is no difference between two treatment means. The p-value assesses the weight of evidence against the null. However, a frighteningly common mistake is that non-significant p-values are often misinterpreted as evidence that the null hypothesis is true, leading to incorrect and potentially dangerous conclusions. So how do we demonstrate two treatment means are essentially the same? Genstat’s new and easy to perform, equivalence test facilities solve this problem. Similarly, Genstat’s new non-superiority test facilities enable us to demonstrate that one treatment is not superior to another, and the new non-inferiority test facilities that one treatment is not inferior to another.
These tests are available in both the T-test and Analysis of Variance menus.
Watch the Equivalence test video.
Genstat provides everything users need to design experiments, analyse data, and generate insights - all in one easy-to-use and reliable statistical software tool. One of its many strengths is its powerful set of linear mixed models tools and its efficient REML algorithm for fitting these models. New to Genstat 22, users can now analyse censored data using a linear mixed model, and obtain well-behaved and reliable estimates of the treatment means. During data collection, censoring occurs when measurements cannot be taken above or below a bound. For example, in chemical assays where small amounts cannot be detected. Observations that fall below this minimum level of detection are therefore censored. In Genstat 22, users can now easily and quickly fit linear mixed models to such censored data.
Watch the Linear mixed models with censoring video.
Genstat 22 offers a raft of new facilities for generalized linear mixed model (GLMM) analysis, including the ability to assess the importance of the fixed effects using permutation tests. A well-known problem with GLMMs is that the estimates of the variance components are generally smaller than the true values. The Wald tests for the fixed effects also suffer from this bias in that their p-values may be too small. Therefore, caution is needed when interpreting the Wald test p-values for fixed effects. The new GLMM permutation tests in Genstat 22 help overcome this problem by providing an alternative, more reliable, way to assess the fixed effects.
The Generalized linear mixed model (GLMM) menus have been updated to allow nearly all the same options as the Generalized linear model (GLM) menu. GLMM provides the extension of the GLM model for non-normal data to allow for multi-level variation as in REML.
Watch the Generalized linear mixed model video.
The new Confidence region plot menu allows you to draw 2-D scatter plots with confidence, prediction and/or equal-frequency ellipses superimposed. This plot lets you look at the relationship between two variables and plot the confidence regions for this.
Watch the Confidence region plot video.
Genstat has long been able to read in data from Excel using menus or the IMPORT command. However, cell colours could not be imported. The 22nd edition can now import the cell foreground and background colours for display in the Genstat spreadsheet. In addition, it can now also import cell formulae and formatting information such as colours, fonts and styles as data. This may be useful where users highlight or format cells as a way of conveying supplementary information, such as the cell being an outlier.
Watch the Improved importation options for Excel files video.
The conventional way to assess fixed terms is to use either Wald or F tests. However, the Wald test generates significant results too frequently leading to misleading conclusions, and whilst the F test is more reliable, the denominator degrees of freedom needs to be estimated, and this is not always possible. Our new procedure, VPERMTEST, provides an alternative method for assessing fixed terms.
Watch the VPERMTEST video.
There are many more new features in Genstat 22 to help users achieve the most value from their data. Chosen by thousands of scientists and researchers across the globe, Genstat’s user-friendly menu interface provides all the standard analyses, guiding non-technical users through the correct and efficient use of statistics. And for more advanced users, Genstat offers a powerful programming language that can be used to perform complex analyses, implement a new or non-standard statistical technique, or automate a task.
For a complete description of all new features and enhancements please visit What’s new in Genstat 22nd edition.
If you would like to take a look into Genstat further or obtain a quote, please email us: info@vsni.co.uk
Dr. David Baird is an experienced biometrician with 40 years’ experience in applying statistics to agriculture, horticulture, biosecurity, entomology, ecology, plant breeding and finance. He has been an author of Genstat for 27 years developing the spreadsheet, statistical menus, client interface and graphics and has contributed over 80 Genstat procedures in areas of data manipulation, experimental design, probability distributions, model fitting, data mining and microarrays.
Dr. Salvador A. Gezan
05 July 2022A binary phenotypic response in breeding: What do I do?
It is common for breeding analyses to have a phenotypic response that is non-Normally distributed. For example, we can have for each genotype survival, or count of flowers or eggs produced, and these will typically follow a Binomial or Poisson distribution, respectively. The issue is then, how do we deal with the quantitative genetics analyses of binary or count data, as our linear mixed model (LMM) framework is based on Normality assumptions, and these responses are clearly non-Normal.
Most statistical software, including ASReml and Genstat, deal with LMMs under Normal data in a very efficient and flexible way, providing strong statistical inference. ASReml and Genstat, under an LMM, can deal with heterogeneous errors, correlated structures (in time and space) and even correlated random effects (as when we use a relationship matrix). However, under non-Normal responses this becomes increasingly difficult. An alternative is to fit a Generalised Linear Mixed Model (GLMM) or Hierarchical Generalised Linear Models (HGLM). These model types consider the underlying probability distribution and allow us to estimate variance components for relevant random effects. GLMMs are implemented in ASReml-SA and ASReml-R with an array of available distributions and link functions. GLMMs can incorporate pedigree, some level of imbalance, and complex variance structures. However, they are not free of complications; for example, they might not converge when complex correlated structures are needed, such as for multi-environment trial and multi-trait analyses, as they do not rely on a multi-variate Normal distribution.
Given the complexity of this data, what are our analytical options? In this blog, we will talk about binary data, and provide you with a few options on how to analyse this data within the context of an operational breeding program. Some of the options are more practical (and often approximated). However, they can help you with providing you the required information and with reasonable accuracies to perform your selection of outstanding genotypes..
For the options below, we will consider that we have a dataset on which each of the n individuals was recorded for a binary response. That is, a 0/1 outcome for a given trait of interest. For example, if they show resistance (1 for success) or not (0 for failure) to a given viral infection. Note that the dataset can have additional factors and covariates, but it will have a total of n observations.
There are several options to deal with this type of data to get exact or approximated results. Below we are going to list (and explain) some of the options we have found useful, with pros and cons. As you might expect, this is not exhaustive, but these options will give you a starting point at the very least.
1. Use Generalized Linear Mixed Models (GLMMs)
GLMMs are an appropriate statistical tool to use with this type of data. As indicated above, they will consider the underlying distribution and they will allow you to estimate the variance components of interest. However, one aspect to consider is that these models are more prone to convergency issues, and there’s a high chance of getting stuck at a local minimum as the algorithms minimize over a more complex likelihood function than under REML for LMMs. This problem is reduced if you use simpler models with little or no imbalance, but they still require careful checking (including whether they make sense biologically).
Also, with GLMMs there are additional elements to deal with, such as which link function to use and how to manage overdispersion. You also have to consider that most of their statistical inference is based on asymptotic assumptions, so hypothesis testing and confidence intervals assume infinite degrees of freedom. GLMMs parameter estimates can have some bias, particularly if your counts are small. In addition, because you are dealing with a Binomial distribution, you will not be able to use the same definition of heritability ( = VA/VP) as we use under Normally distributed data, and there is no explicit error variance!
Nevertheless, GLMMs if they fit properly, will weight the observations appropriately, provide predictions within the natural range of the distribution, and depending on the link used, have some reasonable underlying biological assumptions (as with a Probit link function). Hence, they represent the natural choice for ‘good’ data and ‘simple’ models, and they are widely accepted in the community.
A close alternative to GLMMs is the use of HGLMs which also allow for the incorporation of random effects, with the increased flexibility that they can have several distributions for the random terms, and not only the Normal distribution as with GLMMs. In addition, this class of models tends to present less bias on parameter estimates. We do not discuss further HGLMs as these have other limitations and they are currently not implemented in ASReml.
2. Ignore completely the issue of non-Normality
A different option is to ignore that you have binary data, and fit a LMM using REML based on a Normal distribution for your residuals. Of course, this will create issues. You will see that your residual plots are very strange (2 groups of points) and inferences will be unreliable. Also, you’ll likely have increased bias of the variance component estimates. However, if you use large quantities of data, these problems are less likely to arise. This is because the distribution of your random effects is still a Normal distribution, and therefore, for example parental effects, are estimated based on some form of ‘average’ over all offspring and relatives. If you have a moderate to large sample size, this average, under the central limit theorem (CLT), will be a good approximation to a Normal distribution.
There are several disadvantages to this option. The main one is that there is no boundary associated with the natural range of the response due to the assumption of Normality. For example, you might obtain a prediction for a given family outside the range of 0 to 1 (for example, 1.03). This does not make sense and makes interpretation difficult (like saying that 103% of the individuals in this family will be susceptible!). Other disadvantages are associated with unreliable statistical inference in terms of p-values and confidence intervals.
But there are advantages to using this approach. The main one is that you will be able to estimate a heritability using your traditional equation ( = VA/VP). In addition, this option allows you to fit some more complex models, such as spatial models with correlated errors, and multi-trait models with the use of a multi-variate Normal distribution and correlated errors.
Our experience is that this approach often works well, and it is good to use in combination with the previous GLMM option. A look at the predictions, or BLUPs (EBVs or genetic effects) estimates, from both the GLMM and LMM models is good practice and for us has resulted in high correlations (> 0.95) with the top (and bottom) individuals. Hence, the ‘ignore non-Normality’ approach seems reasonable, particularly under some operational breeding decisions with little loss of information.
3. Implement a transformation of your response variable
Historically, the main recommendation to deal with non-Normal data has been the use of a transformation and then consider this new variable as approximately Normally distributed. This has been the case for Binomial, binary and Poisson data with widely accepted transformations. For binary data we can use a logit transformation, or an arcsine-square root transformation (also known as the angular transformation). Both transformations will provide reasonable approximations to a Normal distribution. But the most important aspect, is that they will ensure your predictions (means, etc.) are contained within the natural range of your original data (i.e., 0 to 1). In addition, the calculation of the heritability in this case is possible using our traditional expression.
This approach is easy to implement, and widely used. But its main disadvantage is that BLUPs and/or predictions are on the transformed scale, and they will need to be back-transformed. However, for standard errors (SEs) there is no back-transformation (note, one alternative is to back-transform confidence limits). In addition, you might find that, even after trying several transformations, your residuals might still not look as Normally distributed as you would like.
In our experience, this is a very good practical alternative for binary data. And, as indicated before, comparing with the previous two model options (GLMM and LMM) will provide with some confidence in its application and the resulting genotype selection. However, be aware, that this approach is frowned upon in some academic circles!
4. Collapse data into a higher stratum
One good alternative is to simplify the data. This can be done by collapsing levels (or experimental strata) in your dataset. For example, if you have 30 plants per plot for which a binary response has been recorded, it is possible to move from the level of the individual plant to the level of the individual plot but considering the average of these 30 plants. And this average becomes effectively a proportion bounded between 0 and 1.
The above approach has several advantages. Firstly, the complexity of the model gets reduced, and hence it is more likely to converge properly. Secondly, given that now the response is an average, this will have an approximate Normal distribution based again on the CLT, and it can be a good approximation for moderate samples used to calculate these averages. In our example, a mean of 30 plants provides with a reasonable approximation. Thirdly, and most importantly, with the use of these averages we can more ‘safely’ rely on the assumption of Normally distributed residuals, and this opens the door to fit LMM complex models (e.g., with temporal correlation or multi-trait models).
There are disadvantages though, the main one being is that our response variable represents a different stratum; hence, heritability must be interpreted at a different stratum (e.g., heritability of the plot average not of the individual plant). Another important statistical disadvantage is that the weight for each observation is different than under a GLMM, and hence the inference will change but its quality will depend on the quality of the approximation. Also, as the LMM analysis has no boundary associated with the response, predictions outside its natural range of the proportion are still possible, but these are less likely to occur. Lastly, some conditions for a good approximation to the CLT need to be observed; for example, when observations are at the extremes (all 0s or all 1s) distribution of residuals will not be reasonable.
In our experience, this is a very good option that does not rely on ignoring the problem (option 2) or manipulating the response with a transformation (option 3). Instead, it relies purely on the beauty of the central limit theorem. If we have plot averages based on 50 plants, then we expect almost no differences with a GLMM model with its appropriate distribution (you can check this and you will see that p-values, predictions and SEs are very similar!).
5. Consider Bayesian approaches to fit GLMM
There are a few sophisticated Bayesian packages that can fit an array of our common genetic models. For example, Stan, WinBUGS and the R libraries MCMCglmm and brms are good examples. Under flexible assumptions, Bayesian software will allow to specify almost any distributions for the response variable, and our ‘random’ and ‘fixed’ effects included in the model. One important advantage of these models is that in comparison with the other options they tend to present less bias on the parameter estimates. Hence, Bayesian models constitute a good option to fit simple and complex models with to our binary response (and also almost any response).
The main disadvantage of these models is the additional level of complexity in fitting these models. This includes, among other things, selecting and clearly defining a prior distribution for each model parameter, and determining the number of iterations, length of the burn-in and degree of thinning. And there is also increased work to do on using different diagnostic tools to assess the quality of the fitted Bayesian model.
Our experience is that a Bayesian model constitutes a great alternative to the methods above if the time and effort is invested in understanding the different software and logic between a frequentist and a Bayesian approach. These models are likely to be more stable, particularly under unbalanced data, and may allow you to fit complex models that are not possible to fit under either LMM or GLMM. Comparison of these models with the more traditional approaches has resulted in high correlations in the breeding values (> 0.95). And something not to forget is that a good Bayesian model allows for more in-depth statistical inferences, for example, we will have posterior distributions of heritability calculation estimates!
In this blog we have presented five options to consider when dealing with binary data, and these can be empirically evaluated to tune-up and decide which is best for your breeding program. However, there are some general aspects of a binary response associated with the amount of information this type of data contains. Binary responses contain only 0s or 1s. Thus, a binary variable is a less informative measure than a continuous variable (think about standard deviations, modes, etc.). Ideally, a binary trait should have a proportion of 1s or 0s close to 50% as at this level the information is maximized. For example, if the proportion of success is 2%, then there are very few data points that actually contribute to differentiate effects (say parents) and to calculate variance components associated with a given factor. For this low level of successes you can almost think that the measurements on your response variance are random!
Another aspect to consider is associated with the strata of the experiment, as mentioned above. It is recommended to carefully plan the design of the experiment, for example to ensure a reasonable number of individuals per plot. This is important if we intend to collapse the data, but it also helps to ensure we have sufficient information for the estimation of variance components at each stratum. If you plan for sufficient degrees of freedom per stratum, it is more likely that the model will fit successfully under either LMM or GLMM, and provide you in addition, reasonable accuracy for your effects and good statistical power for inferences.
A more practical aspect is associated with disease or challenge experiments. This is, for example, when a spore is used to infect plants. Recall that the no information is obtained with ~0% of the genotypes healthy and ~100% infected. This implies that the load of spores (or any other infection approach) has to be well defined to not: a) kill all plants, or b) not kill a single plant. Hence, we recommend that an experiment should be plan at aiming around 50% success. In addition, as with many other similar studies, it is critical to have positive and negative controls to assess the effectiveness of the application.
Finally, it is important that if binary (or Binomial) data is going to be the focus of the analysis, then the study should be planned using the simplest experimental design possible (for example favour randomized complete blocks instead of incomplete block designs), and minimize the level of complexity in the design (i.e., not consider split-plot designs). Always be aware that for fitting GLMMs, it is better to have simple and balanced experiments. And remember, binary responses are information poor, so keep on mind that you might not get as exciting results as you might expect!
Dr. Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world.
Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook Statistical Methods in Biology: Design and Analysis of Experiments and Regression.
The VSNi Team
29 June 2022ASRgenomics: filling the gap on processing molecular data for quantitative genetics
Most breeding programs are supported by an array of genomic information that will provide new options to increase the rates of genetic gain. However, performing statistical analyses with molecular data can be a difficult task. This type of data has to communicate properly with available phenotypic and pedigree data. The overall success of this integration depends on a set of checks, verifications, filters, and careful preparation of all these datasets in order to be able to fit genetic models successfully and to obtain the required output to make correct decisions.
The workflow of molecular data-driven analysis varies based on the source of the datasets and of course, on personal preferences. Nevertheless, regardless of these aspects, an efficient genomics pipeline should rely on answering some of the following questions:
We developed ASRgenomics to help deal with the above questions. This is a free to use R library which can be downloaded from the ASReml knowledgbase. It is a compilation of proven routines developed over several years of study and hands-on experience in the field. ASRgenomics was built with advanced statistical modeling in mind and it fills a gap by helping you make sure your analyses are as efficient and accurate as they can be with several explicit diagnostic tools.
The package is aimed at geneticists and breeders with the purpose of improving their experience with genomic analyses, such as Genomic Selection (GS) and Genome Wide Association Studies (GWAS), in a straightforward and efficient manner. The main capabilities of the package include:
The functions included within ASRgenomics are very flexible and can be used for a tailored workflow from raw molecular data to well-behaved model-ready matrices. Additionally, ASRgenomics is capable of seamlessly preparing genomic datasets for integration with ASReml-R to fit linear mixed models (LMMs; e.g., GBLUP or ssGBLUP).
Please try this free library and check out the user guide included withinin the doc folder inside the download package for a walk-though of the features along with details of the methods.
The VSNi Team
13 June 2022What t-test to use: Student’s or Welch’s?
Imagine we are interested in whether female undergraduate students with blonde or brown hair have different heights, on average. To investigate this, we randomly sample blonde and brown haired female undergraduate students and measure their height.
=
≠
Assumptions | Student’s t-test | Welch’s t-test |
Two independent, random samples | ||
The two populations have Normal distributions | ||
The variances of the two populations are equal |
If the sample sizes are unequal between the two groups, Welch's t-test performs better than Student's t-test.
If you have | Use |
Unequal sample variances and/or unequal sample sizes | Welch's t-test (also known as the unequal variances t-test) performs better (i.e., is more reliable) than Student’s t-test |
Equal population variances and equal sample sizes | Student's t-test has more power than Welch's t-test |
Performing a Student’s t-test or Welch’s t-test in Genstat is straightforward.
From the menu bar, select Stats | Statistical tests | One- and two-sample t-tests. Then, in the T-Tests menu, set the type of test to Two-sample.
We can run a Student’s t-test by clicking the Options button, then selecting Pooled as the method used to estimate the variances for the test.
To run a Welch’s t-test, in the T-Test Options menu select Separate as the method used to estimate the variances for the test.