A binary phenotypic response in breeding analysis: What do I do?

A binary phenotypic response in breeding analysis: What do I do?

Dr. Salvador A. Gezan

05 July 2022
image_blog

The challenge of non-Normal data

It is common for breeding analyses to have a phenotypic response that is non-Normally distributed. For example, we can have for each genotype survival, or count of flowers or eggs produced, and these will typically follow a Binomial or Poisson distribution, respectively. The issue is then, how do we deal with the quantitative genetics analyses of binary or count data, as our linear mixed model (LMM) framework is based on Normality assumptions, and these responses are clearly non-Normal.

Exploring statistical options for binary data analysis

Most statistical software, including ASReml and Genstat, deal with LMMs under Normal data in a very efficient and flexible way, providing strong statistical inference. ASReml and Genstat, under an LMM, can deal with heterogeneous errors, correlated structures (in time and space) and even correlated random effects (as when we use a relationship matrix). However, under non-Normal responses this becomes increasingly difficult. An alternative is to fit a Generalised Linear Mixed Model (GLMM) or Hierarchical Generalised Linear Models (HGLM). These model types consider the underlying probability distribution and allow us to estimate variance components for relevant random effects. GLMMs are implemented in ASReml-SA and ASReml-R with an array of available distributions and link functions. GLMMs can incorporate pedigree, some level of imbalance, and complex variance structures. However, they are not free of complications; for example, they might not converge when complex correlated structures are needed, such as for multi-environment trial and multi-trait analyses, as they do not rely on a multi-variate Normal distribution. 

Given the complexity of this data, what are our analytical options? In this blog, we will talk about binary data, and provide you with a few options on how to analyse this data within the context of an operational breeding program. Some of the options are more practical (and often approximated). However, they can help you with providing you the required information and with reasonable accuracies to perform your selection of outstanding genotypes..

For the options below, we will consider that we have a dataset on which each of the n individuals was recorded for a binary response. That is, a 0/1 outcome for a given trait of interest. For example, if they show resistance (1 for success) or not (0 for failure) to a given viral infection. Note that the dataset can have additional factors and covariates, but it will have a total of n observations.

Options to analyse binary data

There are several options to deal with this type of data to get exact or approximated results. Below we are going to list (and explain) some of the options we have found useful, with pros and cons. As you might expect, this is not exhaustive, but these options will give you a starting point at the very least. 

1. Use Generalized Linear Mixed Models (GLMMs)

GLMMs are an appropriate statistical tool to use with this type of data. As indicated above, they will consider the underlying distribution and they will allow you to estimate the variance components of interest. However, one aspect to consider is that these models are more prone to convergency issues, and there’s a high chance of getting stuck at a local minimum as the algorithms minimize over a more complex likelihood function than under REML for LMMs. This problem is reduced if you use simpler models with little or no imbalance, but they still require careful checking (including whether they make sense biologically). 

Also, with GLMMs there are additional elements to deal with, such as which link function to use and how to manage overdispersion. You also have to consider that most of their statistical inference is based on asymptotic assumptions, so hypothesis testing and confidence intervals assume infinite degrees of freedom. GLMMs parameter estimates can have some bias, particularly if your counts are small. In addition, because you are dealing with a Binomial distribution, you will not be able to use the same definition of heritability (h squared = VA/VP) as we use under Normally distributed data, and there is no explicit error variance!

Nevertheless, GLMMs if they fit properly, will weight the observations appropriately, provide predictions within the natural range of the distribution, and depending on the link used, have some reasonable underlying biological assumptions (as with a Probit link function). Hence, they represent the natural choice for ‘good’ data and ‘simple’ models, and they are widely accepted in the community. 

A close alternative to GLMMs is the use of HGLMs which also allow for the incorporation of random effects, with the increased flexibility that they can have several distributions for the random terms, and not only the Normal distribution as with GLMMs. In addition, this class of models tends to present less bias on parameter estimates. We do not discuss further HGLMs as these have other limitations and they are currently not implemented in ASReml.

2. Ignore completely the issue of non-Normality

A different option is to ignore that you have binary data, and fit a LMM using REML based on a Normal distribution for your residuals. Of course, this will create issues. You will see that your residual plots are very strange (2 groups of points) and inferences will be unreliable. Also, you’ll likely have increased bias of the variance component estimates. However, if you use large quantities of data, these problems are less likely to arise. This is because the distribution of your random effects is still a Normal distribution, and therefore, for example parental effects, are estimated based on some form of ‘average’ over all offspring and relatives. If you have a moderate to large sample size, this average, under the central limit theorem (CLT), will be a good approximation to a Normal distribution. 

There are several disadvantages to this option. The main one is that there is no boundary associated with the natural range of the response due to the assumption of Normality. For example, you might obtain a prediction for a given family outside the range of 0 to 1 (for example, 1.03). This does not make sense and makes interpretation difficult (like saying that 103% of the individuals in this family will be susceptible!). Other disadvantages are associated with unreliable statistical inference in terms of p-values and confidence intervals.

But there are advantages to using this approach. The main one is that you will be able to estimate a heritability using your traditional equation (h squared = VA/VP). In addition, this option allows you to fit some more complex models, such as spatial models with correlated errors, and multi-trait models with the use of a multi-variate Normal distribution and correlated errors.

Our experience is that this approach often works well, and it is good to use in combination with the previous GLMM option. A look at the predictions, or BLUPs (EBVs or genetic effects) estimates, from both the GLMM and LMM models is good practice and for us has resulted in high correlations (> 0.95) with the top (and bottom) individuals.  Hence, the ‘ignore non-Normality’ approach seems reasonable, particularly under some operational breeding decisions with little loss of information.

3. Implement a transformation of your response variable

Historically, the main recommendation to deal with non-Normal data has been the use of a transformation and then consider this new variable as approximately Normally distributed. This has been the case for Binomial, binary and Poisson data with widely accepted transformations. For binary data we can use a logit transformation, or an arcsine-square root transformation (also known as the angular transformation). Both transformations will provide reasonable approximations to a Normal distribution. But the most important aspect, is that they will ensure your predictions (means, etc.) are contained within the natural range of your original data (i.e., 0 to 1). In addition, the calculation of the heritability in this case is possible using our traditional expression.

This approach is easy to implement, and widely used. But its main disadvantage is that BLUPs and/or predictions are on the transformed scale, and they will need to be back-transformed. However, for standard errors (SEs) there is no back-transformation (note, one alternative is to back-transform confidence limits). In addition, you might find that, even after trying several transformations, your residuals might still not look as Normally distributed as you would like. 

In our experience, this is a very good practical alternative for binary data. And, as indicated before, comparing with the previous two model options (GLMM and LMM) will provide with some confidence in its application and the resulting genotype selection. However, be aware, that this approach is frowned upon in some academic circles!

4. Collapse data into a higher stratum

One good alternative is to simplify the data. This can be done by collapsing levels (or experimental strata) in your dataset. For example, if you have 30 plants per plot for which a binary response has been recorded, it is possible to move from the level of the individual plant to the level of the individual plot but considering the average of these 30 plants. And this average becomes effectively a proportion bounded between 0 and 1. 

The above approach has several advantages. Firstly, the complexity of the model gets reduced, and hence it is more likely to converge properly. Secondly, given that now the response is an average, this will have an approximate Normal distribution based again on the CLT, and it can be a good approximation for moderate samples used to calculate these averages. In our example, a mean of 30 plants provides with a reasonable approximation. Thirdly, and most importantly, with the use of these averages we can more ‘safely’ rely on the assumption of Normally distributed residuals, and this opens the door to fit LMM complex models (e.g., with temporal correlation or multi-trait models).

There are disadvantages though, the main one being is that our response variable represents a different stratum; hence, heritability must be interpreted at a different stratum (e.g., heritability of the plot average not of the individual plant). Another important statistical disadvantage is that the weight for each observation is different than under a GLMM, and hence the inference will change but its quality will depend on the quality of the approximation. Also, as the LMM analysis has no boundary associated with the response, predictions outside its natural range of the proportion are still possible, but these are less likely to occur. Lastly, some conditions for a good approximation to the CLT need to be observed; for example, when observations are at the extremes (all 0s or all 1s) distribution of residuals will not be reasonable.

In our experience, this is a very good option that does not rely on ignoring the problem (option 2) or manipulating the response with a transformation (option 3). Instead, it relies purely on the beauty of the central limit theorem. If we have plot averages based on 50 plants, then we expect almost no differences with a GLMM model with its appropriate distribution (you can check this and you will see that p-values, predictions and SEs are very similar!). 

5. Consider Bayesian approaches to fit GLMM

There are a few sophisticated Bayesian packages that can fit an array of our common genetic models. For example, Stan, WinBUGS and the R libraries MCMCglmm and brms are good examples. Under flexible assumptions, Bayesian software will allow to specify almost any distributions for the response variable, and our ‘random’ and ‘fixed’ effects included in the model. One important advantage of these models is that in comparison with the other options they tend to present less bias on the parameter estimates. Hence, Bayesian models constitute a good option to fit simple and complex models with to our binary response (and also almost any response).

The main disadvantage of these models is the additional level of complexity in fitting these models. This includes, among other things, selecting and clearly defining a prior distribution for each model parameter, and determining the number of iterations, length of the burn-in and degree of thinning. And there is also increased work to do on using different diagnostic tools to assess the quality of the fitted Bayesian model. 

Our experience is that a Bayesian model constitutes a great alternative to the methods above if the time and effort is invested in understanding the different software and logic between a frequentist and a Bayesian approach. These models are likely to be more stable, particularly under unbalanced data, and may allow you to fit complex models that are not possible to fit under either LMM or GLMM. Comparison of these models with the more traditional approaches has resulted in high correlations in the breeding values (> 0.95). And something not to forget is that a good Bayesian model allows for more in-depth statistical inferences, for example, we will have posterior distributions of heritability calculation estimates!

Improving the data for my binary analysis

In this blog we have presented five options to consider when dealing with binary data, and these can be empirically evaluated to tune-up and decide which is best for your breeding program. However, there are some general aspects of a binary response associated with the amount of information this type of data contains. Binary responses contain only 0s or 1s. Thus, a binary variable is a less informative measure than a continuous variable (think about standard deviations, modes, etc.). Ideally, a binary trait should have a proportion of 1s or 0s close to 50% as at this level the information is maximized. For example, if the proportion of success is 2%, then there are very few data points that actually contribute to differentiate effects (say parents) and to calculate variance components associated with a given factor. For this low level of successes you can almost think that the measurements on your response variance are random!

Another aspect to consider is associated with the strata of the experiment, as mentioned above. It is recommended to carefully plan the design of the experiment, for example to ensure a reasonable number of individuals per plot. This is important if we intend to collapse the data, but it also helps to ensure we have sufficient information for the estimation of variance components at each stratum. If you plan for sufficient degrees of freedom per stratum, it is more likely that the model will fit successfully under either LMM or GLMM, and provide you in addition, reasonable accuracy for your effects and good statistical power for inferences.

A more practical aspect is associated with disease or challenge experiments. This is, for example, when a spore is used to infect plants. Recall that the no information is obtained with ~0% of the genotypes healthy and ~100% infected. This implies that the load of spores (or any other infection approach) has to be well defined to not: a) kill all plants, or b) not kill a single plant. Hence, we recommend that an experiment should be plan at aiming around 50% success. In addition, as with many other similar studies, it is critical to have positive and negative controls to assess the effectiveness of the application. 

Finally, it is important that if binary (or Binomial) data is going to be the focus of the analysis, then the study should be planned using the simplest experimental design possible (for example favour randomized complete blocks instead of incomplete block designs), and minimize the level of complexity in the design (i.e., not consider split-plot designs). Always be aware that for fitting GLMMs, it is better to have simple and balanced experiments. And remember, binary responses are information poor, so keep on mind that you might not get as exciting results as you might expect! 

About the author

Dr. Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world. 

Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook Statistical Methods in Biology: Design and Analysis of Experiments and Regression.