Why Machine Learning is not (yet) working for Genomic Prediction ML

Why machine learning is not (yet) working for genomic prediction

Dr. Salvador A. Gezan

10 March 2021
image_blog

In plant and animal breeding the use of genomic predictions has become widespread, and it is currently being implemented in many species resulting in increased genetic gains. In genomic prediction  thousands of SNP markers are used as inputs to predict the performance of genotypes. A good model allows the estimation of the performance of a genotype before it is phenotypically measured, allowing for cheaper and earlier selections and accelerating breeding programs.

At present, most of these predictive models use the SNP markers information to fit linear models, where each marker is associated with an estimated effect. These models are linear, and they incorporate our current understanding of the accumulation of allele effects and the use of the infinitesimal model, where the phenotypic response of an individual is the result of hundreds or thousands of QTLs with small effects.

Machine learning - the holy grail?

Machine learning has become widely used in many areas over the last few years. Machine learning is a methodology in which computers are trained with large amounts of data to make predictions. There are many methods, but some of the most common are neural networks, random forests, and decision trees. In machine learning you do not need to understand the biological system; briefly, you provide the computer algorithm with huge amounts of data as training and you obtain a predictive system that can be used to estimate responses. Of course, its implementation is more complex than this description, and a critical part is evaluating the quality of the predictive system obtained.

Machine learning has proven very useful, for example, to compare images to differentiate pictures of cats from dogs, and many other practical uses. Therefore, machine learning methods seem the logical tool for genomic prediction, particularly as we can have a set of genomic data for our crop of interest with up to 200,000 SNPs that were obtained with hundreds or even thousands of individuals. 

There have been several studies on the use of machine learning in genomic prediction but the results have often been disappointing. In all cases, our traditional genomic prediction methods (BayesB and GBLUP) consistently have been superior to most machine learning algorithms. Based on these studies, we are tempted to say that machine learning is not working for breeding and genomics. Yet this is a surprising result for a tool such as machine learning that is constantly being praised in the media as very powerful and that is often associated with solving many daily predictive problems. 

Where machine learning is at a disadvantage…for now

So, currently machine learning is not a good option for use in genonic prediction, but … it is my belief that machine learning is still at a disadvantage against other genomic prediction methods, and with time it might become as good as other approaches or even the gold standard. Some of the reasons for this statement are detailed below.

  • Machine learning requires large, often very large, amounts of data. This is usually not available for most of our current breeding programs. It is true that we have thousands, or even millions of SNPs, but these are poor in information, and highly correlated. In addition, our phenotypic records used to train these machine learning tools, are probably only in the thousands, and not in the hundreds of thousands or millions that are reported in other fields where machine learning has been used successfully
  • We have a pretty good understanding of gene action. Note that machine learning is often a black box, where our understanding of the biological system is ignored. However, for our genomic prediction models, we have good clarity on the mode of action of the accumulation of alleles to denote additive effects, and this can be extended to dominant effects. This, followed by the dynamics of Mendelian and Fisherian genetics where we have a few QTLs with strong influences or a large number of QTLs with small influences, has led us to use marker assisted selection and pedigree-based analyses successfully over the last 50 years.
  • We have an important gap between the computer scientists developing the machine learning tools we can use, and breeders or quantitative geneticists. In most successful breeding programs, there is a strong statistical component for design and analysis of experiments, and now with the use of genomic data, we have extended our models from pedigree-based analyses to molecular-based analyses or a combination. However, the use of computationally intensive and rapidly evolving machine learning methods, have been elusive to most breeding programs, and in some cases, this is accompanied by a lack of understanding of the software that trains the machine learning models.

The routine implementation of machine learning in breeding programs will take some time. But as we accumulate information, and we learn and interact with machine learning software and its routines, we will slowly see it being used in our crops. This will not mean the end of our more traditional tools or their replacement by machine learning applications. Our current understanding of the biology and the specific nature of our crops will still make our current toolbox valuable. It is our understanding that at the present, machine learning is not ready for breeding, but in due time it will creep up next to us!

About the author

Dr. Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world. 

Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook Statistical Methods in Biology: Design and Analysis of Experiments and Regression.