Dr. Salvador A. Gezan
21 June 2022At the present, with the fast development of genotyping, we have access to genomic data that we can use with different statistical and computational approaches, for example, to accelerate genetic gains and to select outstanding genotypes as parents for commercial release. Genomic data is also useful to assess genetic variability and diversity in the context of breeding programs or other research studies.
Most of the genomic data used for breeding comes from single nucleotide polymorphism (SNP). Typically, we see this data as a matrix containing several individuals, often thousands, genotyped for a number of SNPs (i.e., nucleotide readings AA, AC, TT, etc.). Depending on the quality and characteristics of this data, we can use it for different analytical purposes.
In this blog, we will describe some of the uses of this SNP data. We will focus mainly on the available number of SNPs (i.e., markers) and what analyses these enable. Note, the classification of low-, medium- and high-density panels is somewhat arbitrary but it’s what’s often used in plant breeding.
Let’s start with what is known as low-density (LD) panels. These typically contain fewer than 200 SNPs, and are the cheapest option you can have (often just a couple of US dollars per sample). With this small number of SNP markers, their main use is for quality assurance (QA) and quality control (QC). That is, they can be used for:
Other uses are possible but they can be limited. For example, we could identify sibships on some individuals (e.g., full-sibs), and it may be possible to calculate some population statistics of genetic diversity. However, these uses can yield high levels of uncertainty.
An important aspect of LD panels it that they often have markers for verification, and in addition some markers will be missing due to genotyping issues. Hence, their effective size will be lower than their nominal size. Another important consideration is that often these commercial panels are constructed for general use; hence, they may not exactly be based on the population of interest. This will lead to some fixated (MAF = 0%) or almost fixated (MAF < 2%) markers that have little or no contribution to the above uses.
However, despite the above shortcomings, LD panels constitute a very good and cheap alternative for some small programs to start considering genomic tools in their breeding; and for example, verifying crosses is a critical step!
The medium-density (MD) panels are defined very differently depending on the field. Here, we will consider the plant breeding definition of between 2,000 to 10,000 SNPs (i.e., 2K to 10K). In animal breeding, MD panels will be at least 5 times larger. This larger number of SNPs offers more opportunities, but the cost of an MD panel is several times greater than an LD panel. This increased cost often limits, for most breeding programs, the number of individuals to be genotyped. However, they offer more genomic analysis opportunities. In addition to the uses for LD panels mentioned above, for MD panels we also have:
In addition, MD panels can also include several markers previously detected for use in Marker Assisted Selection (MAS) or for sex determination. Given the larger number of SNPs, the presence of verification, missing or fixed markers is of less concern, but should be kept under control in order to maximize the usefulness of the panel. Good or poor selection of SNP markers can make a large difference on the life of the panel and the accuracy of the GP models. Hence, it is recommend to ensure markers are well selected for the construction of these panels.
At the present, MD panels should be the panel of choice for most breeding operations. These panels will ensure that the collected data is reasonably good for use in the future, especially as technology and prices will change, justifying the genotyping investment in the long term. In addition, these represent the best panels for breeding programs to start applying (and playing) with more sophisticated genomic tools, such as genomic prediction with GBLUP and/or Bayes B.
Finally, we have the high-density (HD) panels with more than 20,000 SNPs (20K). But again, this is relative. Most HD panels for plants and aquatic species are in the range of 30K to 70K, but in some species the size considered is in excess of 700K (as in dairy) or even in the millions (as in humans). The cost of an HD panel varies greatly with the number of SNPs considered, the technology used, and number of individuals to genotype. The uses that these panels provide, in addition to all the previous uses for LD and MD panels, are:
These HD panels often are very redundant in their information. Hence, they tend to present a good portion of fixed alleles, and missing data that rules many markers out. However, they are still sufficiently large enough to provide all required above mentioned uses. It is also possible that these panels can be used by a combination of programs or research groups, allowing: 1) pooling of resources, 2) price negotiation, and 3) sharing of genomic information. This is commonly seen in some genomic consortiums. In addition, these HD panels, if originally constructed using a diverse base population, might not require much future revisions.
Another important aspect of HD panels is that they constitute the raw information for imputation in MD panels (and, although only remotely possible, in LD panels) and they will be the ones that in the future will link the array of available panels used in a breeding program (including, LD, MD or from different groups). Hence, HD panels allow us to connect different sources of genomic data.
One important note is in relation to the number of markers required for genomic selection (GS). We recommend at least an MD panel is used for this, with no fewer than 2K useful (post-filtering) SNPs. It is interesting to note that some studies have reported that dropping the number of SNPs to 1K or fewer results in a considerable loss of accuracy of the GP models. And, also, often more than 10K SNPs does not yield considerably better accuracies than 5K SNPs. In addition, some studies have successfully focused on 2-3K SNP panels supported with imputation from an HD panel, with an interesting increase in accuracy. Hence, there are plenty of options to exploit MD panels.
Another aspect is in relation to maximizing the accuracy of GP models. Of course, the more informative SNPs available the better! But there are many other aspects that affect, in good or bad ways, the success of a genomic model. For example, the level of relatedness between the training and evaluation populations, linkage-disequilibrium, genetic architecture of the traits, and of course heritability of the traits of interest. All of these elements may eventually tilt the decision from one type of panel to another.
Another difficulty that can arise is the detection of markers associated with some traits that are present in the population at very low rates, such as the case of ‘standing genetic variation’. This implies that it is difficult to find these markers on most panels as they will tend to be dropped early. Therefore, specific or very large HD panels might be needed in these cases.
Finally, as mentioned before, the use of LD or MD panels requires a careful pre-selection of markers to make the most of these panels. If this is done poorly, or for another population (e.g., using an available panel developed for another breeding group), then the benefits of the corresponding panels will possibly be greatly reduced. This also implies that, particularly for the LD panel, the set of markers in use has to be constantly reviewed as the population changes over time or new markers from MAS or sex determination are discovered.
As the cost and offering of these panels changes constantly, we suspect that at some point we will be able to afford HD panels for a few cents (or pennies)! But before we get there, we need to make the most of our current resources, and gathering the right data for the right analysis is critical.
Dr. Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world.
Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook Statistical Methods in Biology: Design and Analysis of Experiments and Regression.
Related Reads