About us
Why Machine Learning is not (yet) working for Genomic Prediction

Dr. Salvador A. Gezan

a month ago

In plant and animal breeding the use of genomic predictions has become widespread, and it is currently being implemented in many species resulting in increased genetic gains. In genomic prediction (GP) thousands of SNP markers are used as input to predict the performance of genotypes. A good model allows the estimation of the performance of a genotype before it is phenotypically measured allowing for cheaper and earlier selections, accelerating breeding programs.

At present, most of these predictive models use the SNP markers information to fit linear models, where each marker is associated with an estimated effect. These models are linear, and they incorporate our current understanding of the accumulation of allele effects and the use of the infinitesimal model, where the phenotypic response of an individual is the result of hundreds or thousands of QTLs with small effects.

Machine Learning - the holy grail?

Machine learning (ML) has become widely used in many areas over the last few years. ML is a methodology in which computers are trained with large amounts of data to make predictions. There are many methods, but some of the most common are neural networks, random forests, and decision trees. In ML you do not need to understand the biological system; briefly, you provide the computer algorithm with huge amounts of data as training and you obtain a predictive system that can be used to estimate responses. Of course, its implementation is more complex than this description, and a critical part is evaluating the quality of the predictive system obtained.

ML has proven very useful, for example, to compare images to differentiate pictures of cats from dogs, and many other practical uses. Therefore, ML methods seem the logical tool for GP, particularly as we can have a set of genomic data for our crop of interest with up to 200,000 SNPs that were obtained with hundreds or even thousands of individuals. 

There have been several studies on the use of ML in GP but the results often have been disappointing. In all cases, our traditional genomic prediction methods (BayesB and GBLUP) consistently have been superior to most ML algorithms. Based on these studies, we are tempted to say that ML is not working for breeding and genomics. Yet this is a surprising result for a tool such as ML that is constantly being praised in the media as very powerful and that is often associated with solving many daily predictive problems. 

Where Machine Learning is at a disadvantage…for now

So, currently ML is not a good option for use in GP, but … it is my belief that ML is still at a disadvantage against other GP methods, and with time it might become as good as other approaches or even the gold standard. Some of the reasons for this statement are detailed below.

  • ML requires large, often very large, amounts of data. This is usually not available for most of our current breeding programs. It is true that we have thousands, or even millions of SNPs, but these are poor in information, and highly correlated. In addition, our phenotypic records used to train these ML tools, are probably only in the thousands, and not in the hundreds of thousands or millions that are reported in other fields where ML has been used successfully
  • We have a pretty good understanding of gene action. Note that ML is often a black box, where our understanding of the biological system is ignored. However, for our GP models, we have good clarity on the mode of action of the accumulation of alleles to denote additive effects, and this can be extended to dominant effects. This, followed by the dynamics of Mendelian and Fisherian genetics where we have a few QTLs with strong influences or a large number of QTLs with small influences, has led us to use marker assisted selection and pedigree-based analyses successfully over the last 50 years.
  • We have an important gap between the computer scientists developing the ML tools we can use, and breeders or quantitative geneticists. In most successful breeding programs, there is a strong statistical component for design and analysis of experiments, and now with the use of genomic data, we have extended our models from pedigree-based analyses to molecular-based analyses or a combination. However, the use of computationally intensive and rapidly evolving ML methods, have been elusive to most breeding programs, and in some cases, this is accompanied by a lack of understanding of the software that trains the ML models.

The routine implementation of ML in breeding programs will take some time. But as we accumulate information, and we learn and interact with ML software and its routines, we will slowly see it being used in our crops. This will not mean the end of our more traditional tools or their replacement by ML applications. Our current understanding of the biology and the specific nature of our crops will still make our current toolbox valuable. It is our understanding that at the present, machine learning is not ready for breeding, but in due time it will creep up next to us!! 

Salvador A. Gezan

March, 2021


Related Reads


Kanchana Punyawaew

2 months ago
Linear mixed models: a balanced lattice square

This blog illustrates how to analyze data from a field experiment with a balanced lattice square design using linear mixed models. We’ll consider two models: the balanced lattice square model and a spatial model. The example data are from a field experiment conducted at Slate Hall Farm, UK, in 1976 (Gilmour _et al_., 1995). The experiment was set up to compare the performance of 25 varieties of barley and was designed as a balanced lattice square with six replicates laid out in a 10 x 15 rectangular grid. Each replicate contained exactly one plot for every variety. The variety grown in each plot, and the coding of the replicates and lattice blocks, is shown in the field layout below: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_layout_7f57633d37_892b6cf234.png) There are seven columns in the data frame: five blocking factors (_Rep, RowRep, ColRep, Row, Column_), one treatment factor, _Variety_, and the response variate, _yield_. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_data_bd9f4ee008_06c8a6e6fc.png) The six replicates are numbered from 1 to 6 (_Rep_). The lattice block numbering is coded within replicates. That is, within each replicates the lattice rows (_RowRep_) and lattice columns (_ColRep_) are both numbered from 1 to 5. The _Row_ and _Column_ factors define the row and column positions within the field (rather than within each replicate). ### Analysis of a balanced lattice square design To analyze the response variable, _yield_, we need to identify the two basic components of the experiment: the treatment structure and the blocking (or design) structure. The treatment structure consists of the set of treatments, or treatment combinations, selected to study or to compare. In our example, there is one treatment factor with 25 levels, _Variety_ (i.e. the 25 different varieties of barley). The blocking structure of replicates (_Rep_), lattice rows within replicates (_Rep:RowRep_), and lattice columns within replicates (_Rep:ColRep_) reflects the balanced lattice square design. In a mixed model analysis, the treatment factors are (usually) fitted as fixed effects and the blocking factors as random. The balanced lattice square model is fitted in [ASReml-R4](https://www.vsni.co.uk/software/asreml-r) using the following code: ```plaintext &gt; lattice.asr &lt;- asreml(fixed = yield ~ Variety, random = ~ Rep + Rep:RowRep + Rep:ColRep, data=data1) ``` The REML log-likelihood is -707.786. The model’s BIC is: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_2_ac553eac69_6d6d40e073.jpg) The estimated variance components are: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_3_69e11e2dff_c34641a3a9.jpg) The table above contains the estimated variance components for all terms in the random model. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. The variance components for _Rep_, _Rep:RowRep_ and _Rep:ColRep_ are estimated as 4263, 15596, and 14813, respectively. As is typical, the largest unit (replicate) is more variable than its sub-units (lattice rows and columns within replicates). The _"units!R"_ component is the residual variance. By default, fixed effects in ASReml-R4 are tested using sequential Wald tests: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_4_e237aed045_274881533e.jpg) In this example, there are two terms in the summary table: the overall mean, (_Intercept_), and _Variety_. As the tests are sequential, the effect of the _Variety_ is assessed by calculating the change in sums of squares between the two models (_Intercept_)+_Variety_ and (_Intercept_). The p-value (Pr(Chisq)) of  \< 2.2 x 10-16 indicates that _Variety_ is a highly significant. The predicted means for the _Variety_ can be obtained using the predict() function. The standard error of the difference between any pair of variety means is 62. Note: all variety means have the same standard error as the design is balanced. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_5_575ede3e94_5b9209f7c3.jpg) Note: the same analysis is obtained when the random model is redefined as replicates (_Rep_), rows within replicates (_Rep:Row_) and columns within replicates (_Rep:Column_). ### Spatial analysis of a field experiment As the plots are laid out in a grid, the data can also be analyzed using a spatial model. We’ll illustrate spatial analysis by fitting a model with a separable first order autoregressive process in the field row (_Row_) and field column (_Column_) directions. This is often a useful model to start the spatial modeling process. The separable first order autoregressive spatial model is fitted in ASReml-R4 using the following code: ```plaintext &gt; spatial.asr &lt;- asreml(fixed = yield ~ Variety, residual = ~ar1(Row):ar1(Column), data = data1) ``` The BIC for this spatial model is: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_6_3b978358f9_e792bcc2bd.jpg) The estimated variance components and sequential Wald tests are: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_7_82255b3b94_b5bc40e6ab.jpg) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_8_544d852c25_53b792377f.jpg) The residual variance is 38713, the estimated row correlation is 0.458, and the estimated column correlation is 0.684. As for the balanced lattice square model, there is strong evidence of a _Variety_ effect (p-value \< 2.2 x 10-16). A [log-likelihood ratio test](https://www.statisticshowto.com/likelihood-ratio-tests/) cannot be used to compare the balanced lattice square model with the spatial models, as the variance models are not nested. However, the two models can be compared using BIC. As the spatial model has a smaller BIC (1415) than the balanced lattice square model (1435), of the two models explored in this blog, it is chosen as the preferred model. However, selecting the optimal spatial model can be difficult. The current spatial model can be extended by including measurement error (or nugget effect) or revised by selecting a different variance model for the spatial effects. #### References Butler, D.G., Cullis, B.R., Gilmour, A. R., Gogel, B.G. and Thompson, R. (2017). _ASReml-R Reference Manual Version 4._ VSN International Ltd, Hemel Hempstead, HP2 4TP UK. Gilmour, A. R., Anderson, R. D. and Rae, A. L. (1995). _The analysis of binomial data by a generalised linear mixed model_, Biometrika 72: 593-599..


Kanchana Punyawaew and Dr. Vanessa Cave

2 months ago
Mixed Models for Repeated Measures and Longitudinal Data

The term “**repeated measures**” refers to experimental designs or observational studies in which each experimental unit (or subject) is measured repeatedly over time or space. "**Longitudinal data**" is a special case of repeated measures in which variables are measured over time (often for a comparatively long period of time) and duration itself is typically a variable of interest. In terms of data analysis, it doesn’t really matter what type of data you have, as you can analyze both using mixed models. Remember, the key feature of both types of data is that the response variable is measured more than once on each experimental unit, and these repeated measurements are likely to be correlated. ### Mixed Model Approaches To illustrate the use of mixed model approaches for analyzing repeated measures, we’ll examine a data set from Landau and Everitt’s 2004 book, “_A Handbook of Statistical Analyses using SPSS”. Here, a double-blind, placebo-controlled clinical trial was conducted to determine whether an estrogen treatment reduces post-natal depression. Sixty three subjects were randomly assigned to one of two treatment groups: placebo (27 subjects) and estrogen treatment (36 subjects). Depression scores were measured on each subject at baseline, i.e. before randomization (predep_) and at six two-monthly visits after randomization (_postdep_ at visits 1-6). However, not all the women in the trial had their depression score recorded on all scheduled visits. In this example, the data were measured at fixed, equally spaced, time points. (_Visit_ is time as a factor and _nVisit_ is time as a continuous variable.) There is one between-subject factor (_Group_, i.e. the treatment group, either placebo or estrogen treatment), one within-subject factor (_Visit_ or _nVisit_) and a covariate (_predep_). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_data_4f63d505a9_20e39072bf.png) Using the following plots, we can explore the data. In the first plot below, the depression scores for each subject are plotted against time, including the baseline, separately for each treatment group. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_1_4149bce2a1_20e3c0f240.png) In the second plot, the mean depression score for each treatment group is plotted over time. From these plots, we can see variation among subjects within each treatment group that depression scores for subjects generally decrease with time, and on average the depression score at each visit is lower with the estrogen treatment than the placebo. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_2_92810e7fc9_da9b1e85ff.png) ### Random effects model The simplest approach for [analyzing repeated measures data](https://www.theanalysisfactor.com/repeated-measures-approaches/) is to use a random effects model with _**subject**_ fitted as random. It assumes a constant correlation between all observations on the same subject. The analysis objectives can either be to measure the average treatment effect over time or to assess treatment effects at each time point and to test whether treatment interacts with time. In this example, the treatment (_Group_), time (_Visit_), treatment by time interaction (_Group:Visit_) and baseline (_predep_) effects can all be fitted as fixed. The subject effects are fitted as random, allowing for constant correlation between depression scores taken on the same subject over time. The code and output from fitting this model in [ASReml-R 4](https://www.vsni.co.uk/software/asreml-r) follows; ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/4_020d75dee9.png) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/5_ef250deb61.png) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/6_15e353865d.png) The output from summary() shows that the estimate of subject and residual variance from the model are 15.10 and 11.53, respectively, giving a total variance of 15.10 + 11.53 = 26.63. The Wald test (from the wald.asreml() table) for _predep_, _Group_ and _Visit_ are significant (probability level (Pr) ≤ 0.01). There appears to be no relationship between treatment group and time (_Group:Visit_) i.e. the probability level is greater than 0.05 (Pr = 0.8636). ### Covariance model In practice, often the correlation between observations on the same subject is not constant. It is common to expect that the covariances of measurements made closer together in time are more similar than those at more distant times. Mixed models can accommodate many different covariance patterns. The ideal usage is to select the pattern that best reflects the true covariance structure of the data. A typical strategy is to start with a simple pattern, such as compound symmetry or first-order autoregressive, and test if a more complex pattern leads to a significant improvement in the likelihood. Note: using a covariance model with a simple correlation structure (i.e. uniform) will provide the same results as fitting a random effects model with random subject. In ASReml-R 4 we use the corv() function on time (i.e. _Visit_) to specify uniform correlation between depression scores taken on the same subject over time. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/7_3f3a2b825a.png) Here, the estimate of the correlation among times (_Visit_) is 0.57 and the estimate of the residual variance is 26.63 (identical to the total variance of the random effects model, asr1). Specifying a heterogeneous first-order autoregressive covariance structure is easily done in ASReml-R 4 by changing the variance-covariance function in the residual term from corv() to ar1h(). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/8_27fce61956.png) ### Random coefficients model When the relationship of a measurement with time is of interest, a [random coefficients model](https://encyclopediaofmath.org/wiki/Random_coefficient_models) is often appropriate. In a random coefficients model, time is considered a continuous variable, and the subject and subject by time interaction (_Subject:nVisit_) are fitted as random effects. This allows the slopes and intercepts to vary randomly between subjects, resulting in a separate regression line to be fitted for each subject. However, importantly, the slopes and intercepts are correlated. The str() function of asreml() call is used for fitting a random coefficient model; ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/9_ec27199248.png) The summary table contains the variance parameter for _Subject_ (the set of intercepts, 23.24) and _Subject:nVisit_ (the set of slopes, 0.89), the estimate of correlation between the slopes and intercepts (-0.57) and the estimate of residual variance (8.38). ### References Brady T. West, Kathleen B. Welch and Andrzej T. Galecki (2007). _Linear Mixed Models: A Practical Guide Using Statistical Software_. Chapman & Hall/CRC, Taylor & Francis Group, LLC. Brown, H. and R. Prescott (2015). _Applied Mixed Models in Medicine_. Third Edition. John Wiley & Sons Ltd, England. Sabine Landau and Brian S. Everitt (2004). _A Handbook of Statistical Analyses using SPSS_. Chapman & Hall/CRC Press LLC.


Dr. John Rogers

a month ago
50 years of bioscience statistics

Earlier this year I had an enquiry from Carey Langley of VSNi as to why I had not renewed my Genstat licence. The truth was simple – I have decided to fully retire after 50 years as an agricultural entomologist / applied biologist / consultant. This prompted some reflections about the evolution of bioscience data analysis that I have experienced over that half century, a period during which most of my focus was the interaction between insects and their plant hosts; both how insect feeding impacts on plant growth and crop yield, and how plants impact on the development of the insects that feed on them and on their natural enemies. ### Where it began – paper and post My journey into bioscience data analysis started with undergraduate courses in biometry – yes, it was an agriculture faculty, so it was biometry not statistics. We started doing statistical analyses using full keyboard Monroe calculators (for those of you who don’t know what I am talking about, you can find them [here)](http://www.johnwolff.id.au/calculators/Monroe/Monroe.htm).  It was a simpler time and as undergraduates we thought it was hugely funny to divide 1 by 0 until the blue smoke came out… After leaving university in the early 1970s, I started working for the Agriculture Department of an Australian state government, at a small country research station. Statistical analysis was rudimentary to say the least. If you were motivated, there was always the option of running analyses yourself by hand, given the appearance of the first scientific calculators in the early 1970s. If you wanted a formal statistical analysis of your data, you would mail off a paper copy of the raw data to Biometry Branch… and wait.  Some months later, you would get back your ANOVA, regression, or whatever the biometrician thought appropriate to do, on paper with some indication of what treatments were different from what other treatments.  Dose-mortality data was dealt with by manually plotting data onto probit paper.  ### Enter the mainframe In-house ANOVA programs running on central mainframes were a step forward some years later as it at least enabled us to run our own analyses, as long as you wanted to do an ANOVA…. However, it also required a 2 hours’ drive to the nearest card reader, with the actual computer a further 1000 kilometres away.… The first desktop computer I used for statistical analysis was in the early 1980s and was a CP/M machine with two 8-inch floppy discs with, I think, 256k of memory, and booting it required turning a key and pressing the blue button - yes, really! And about the same time, the local agricultural economist drove us crazy extolling the virtues of a program called Lotus 1-2-3! Having been brought up on a solid diet of the classic texts such as Steele and Torrie, Cochran and Cox and Sokal and Rohlf, the primary frustration during this period was not having ready access to the statistical analyses you knew were appropriate for your data. Typical modes of operating for agricultural scientists in that era were randomised blocks of various degrees of complexity, thus the emphasis on ANOVA in the software that was available in-house. Those of us who also had less-structured ecological data were less well catered for. My first access to a comprehensive statistics package was during the early to mid-1980s at one of the American Land Grant universities. It was a revelation to be able to run virtually whatever statistical test deemed necessary. Access to non-linear regression was a definite plus, given the non-linear nature of many biological responses. As well, being able to run a series of models to test specific hypotheses opened up new options for more elegant and insightful analyses. Looking back from 2021, such things look very trivial, but compared to where we came from in the 1970s, they were significant steps forward. ### Enter Genstat My first exposure to Genstat, VSNi’s stalwart statistical software package, was Genstat for Windows, Third Edition (1997). Simple things like the availability of residual plots made a difference for us entomologists, given that much of our data had non-normal errors; it took the guesswork out of whether and what transformations to use. The availability of regressions with grouped data also opened some previously closed doors.  After a deviation away from hands-on research, I came back to biological-data analysis in the mid-2000s and found myself working with repeated-measures and survival / mortality data, so ventured into repeated-measures restricted maximum likelihood analyses and generalised linear mixed models for the first time (with assistance from a couple of Roger Payne’s training courses in Hobart and Queenstown). Looking back, it is interesting how quickly I became blasé about such computationally intensive analyses that would run in seconds on my laptop or desktop, forgetting that I was doing ANOVAs by hand 40 years earlier when John Nelder was developing generalised linear models. How the world has changed! ### Partnership and support Of importance to my Genstat experience was the level of support that was available to me as a Genstat licensee. Over the last 15 years or so, as I attempted some of these more complex analyses, my aspirations were somewhat ahead of my abilities, and it was always reassuring to know that Genstat Support was only ever an email away. A couple of examples will flesh this out.  Back in 2008, I was working on the relationship between insect-pest density and crop yield using R2LINES, but had extra linear X’s related to plant vigour in addition to the measure of pest infestation. A support-enquiry email produced an overnight response from Roger Payne that basically said, “Try this”. While I slept, Roger had written an extension to R2LINES to incorporate extra linear X’s. This was later incorporated into the regular releases of Genstat. This work led to the clearer specification of the pest densities that warranted chemical control in soybeans and dry beans ([https://doi.org/10.1016/j.cropro.2009.08.016](https://doi.org/10.1016/j.cropro.2009.08.016) and [https://doi.org/10.1016/j.cropro.2009.08.015](https://doi.org/10.1016/j.cropro.2009.08.015)). More recently, I was attempting to disentangle the effects on caterpillar mortality of the two Cry insecticidal proteins in transgenic cotton and, while I got close, I would not have got the analysis to run properly without Roger’s support. The data was scant in the bottom half of the overall dose-response curves for both Cry proteins, but it was possible to fit asymptotic exponentials that modelled the upper half of each curve. The final double-exponential response surface I fitted with Roger’s assistance showed clearly that the dose-mortality response was stronger for one of the Cry proteins than the other, and that there was no synergistic action between the two proteins ([https://doi.org/10.1016/j.cropro.2015.10.013](https://doi.org/10.1016/j.cropro.2015.10.013))  ### The value of a comprehensive statistics packag**e** One thing that I especially appreciate about having access to a comprehensive statistics package such as Genstat is having the capacity to tease apart biological data to get at the underlying relationships. About 10 years ago, I was asked to look at some data on the impact of cold stress on the expression of the Cry2Ab insecticidal protein in transgenic cotton. The data set was seemingly simple - two years of pot-trial data where groups of pots were either left out overnight or protected from low overnight temperatures by being moved into a glasshouse, plus temperature data and Cry2Ab protein levels. A REML analysis, and some correlations and regressions enabled me to show that cold overnight temperatures did reduce Cry2Ab protein levels, that the effects occurred for up to 6 days after the cold period and that the threshold for these effects was approximately 14 Cº ([https://doi.org/10.1603/EC09369](https://doi.org/10.1603/EC09369)). What I took from this piece of work is how powerful a comprehensive statistics package can be in teasing apart important biological insights from what was seemingly very simple data. Note that I did not use any statistics that were cutting edge, just a combination of REML, correlation and regression analyses, but used these techniques to guide the dissection of the relationships in the data to end up with an elegant and insightful outcome. ### Final reflections Looking back over 50 years of work, one thing stands out for me: the huge advances that have occurred in the statistical analysis of biological data has allowed much more insightful statistical analyses that has, in turn, allowed biological scientists to more elegantly pull apart the interactions between insects and their plant hosts.  For me, Genstat has played a pivotal role in that process. I shall miss it. **Dr John Rogers** Research Connections and Consulting St Lucia, Queensland 4067, Australia Phone/Fax: +61 (0)7 3720 9065 Mobile: 0409 200 701 Email: [john.rogers@rcac.net.au](mailto:john.rogers@rcac.net.au) Alternate email: [D.John.Rogers@gmail.com](mailto:D.John.Rogers@gmail.com)


Dr. Salvador A. Gezan

4 days ago
Why do I hate zeros in my dataset?

It is always good practice to explore the data before you fit a model. A clear understanding of the dataset helps you to select the appropriate statistical approach and, in the case of linear models, to identify the corresponding design and treatment structure by defining relevant variates and factors.   So, I have in my hands a dataset from a given study, and I proceed to explore it, maybe to do some data cleaning, but mainly to get familiar with it. Assessing predictors is important but more critical is to evaluate the single or multiple response variables that need to be analysed. And it is in these columns where I often find surprises. Sometimes they contain not only numbers, as they should for linear model responses, but also non-numeric data. I have found comments (‘missing’ or ‘not found’), letters (‘?’), and one or more definitions of missing values (‘NA’, ‘NaN’, ‘\*’, ‘.’ or even ‘-9’). But what is the most disturbing to me is the ZEROS, and I especially hate them when they come in masses!  But why do zeros make me angry?! Because their definition is not clear, and they can be the cause of large errors, and ultimately incorrect models. Here are some of my reasons…   ### **Missing Values**  First, it is common to use zero as the definition for missing values. For example, a plant that did not have any fruit has a zero value. But what if the plant died before fruiting? Yes, it will have zero fruits, but here the experimental unit (the plant) no longer exists. In this case, there is a big difference between a true zero that was observed and a zero because of missing data.  ### **Default Values**  Second, zeros are sometimes used as default values in the columns of spreadsheets. That is, you start with a column of zeros that is replaced by true records. However, for many reasons data points may not be collected (for example, you could skip measuring your last replication), and hence some cells of the spreadsheet are not visited, and their values are unchanged from the zero default. Again, these are true missing values, and therefore they need to be recorded in a way that indicates that they were not observed!  ### **Misleading Values**   Third, zeros are often values reflecting measurements that are below the detection limit. For example, if the weighing balance precision is \<0.5 grams then any weight of seed below 0.5 grams will be recorded as a zero. Yes, we do have a range of seed weights reaching 23 grams, and a small portion might be below 1 gram, but in this case the zeros are not really zeros, they approximate a true unknown record between 0 and 0.5 grams.  When, under an initial exploration of the dataset we discover that there are lots of zeros, we need to question why they are occurring. Of course, conversations with the researcher and the staff doing the data recording will give critical insight. This should help us identify the true zeros from the false ones. If there are no missing values recorded in the data, then we might think that some of these zeros are missing values. Here is where I like to explore additional columns (e.g., survival notes) to help ‘recover’ the missing values. However, it might be impossible to discriminate between the true zeros and the missing values if this extra information was not recorded in the dataset. This unfortunate situation, to the misfortune of my collaborators, might mean that the dataset must be completely discarded.   In the case of missing values due to detection limits, the best approach is to ask the researcher. Here, I like to first make sure that this is really the case, and from there make an educated decision on how to analyse the data. Replacing undetected observations by a zero creates two undesired issues:   1. A bias, as these values are not zero, but for example, as in our previous case they have an average value of 0.25 grams (i.e., half the detection limit), and 2. Reduced background variability, as all undetected observations are recorded with exactly the same value when in fact they are not identical, but we can’t see this variability! Finally, there is another reason for me to hate zeros. Suppose that they are all verified valid numbers, but that we still have a high proportion of zeros in our dataset. For example, in a study on fruit yield, I might have 20% of live plants producing no fruit, resulting in 20% true zeros in my dataset. This large proportion of zeros creates difficulties for traditional statistical analyses. For example, when fitting a linear model, the assumption of an approximate Normal distribution might no longer hold, and this will be reflected in residual plots with a strange appearance!   So, what is the solution for this ‘excess’ of zeros? In some cases, a simple transformation could reduce the influence of these zeros in my analyses. Often, the most logical alternative is to rethink the biological process to model, and this might require something different than our typical statistical tools. For example, we could separate the process into two parts. The first part separates the zeros from the non-zeros using a Binomial model that includes several explanatory variables (e.g., age, size, sex). The second part deals only with the non-zero values and fits another model based on, say a Normal distribution, that will include the same or other explanatory variables, but in this case we model the magnitude of this response. This is the basis of some of the Hurdle models, but other statistical approaches, particularly Bayesian, are also available.  In summary, I have many reasons to hate zeros, and you might have a few additional ones. However, I believe they are a critical part of data exploration: not only they can be the tip of an iceberg leading to a better understanding and modelling of the process under which the data was obtained, but they also help to identify potentially more adequate models to describe the system. Hence, perhaps I should embrace the zeros in my dataset and not be so angry about them! Salvador A. Gezan April, 2021 [LinkedIn](https://www.linkedin.com/in/salvador-gezan-54768a1a/)

A world leader in the advancement and application of algorithmic and analytical content for the smart/precision biotech sector

Follow us

youtube     twitter     linkedin
Copyright © 2000-2021 VSN International Ltd. | Privacy Policy | EULA | Terms & Conditions | Sitemap