How many SNP markers do I need for my genomic analyses?

How many SNP markers do I need for my genomic analyses?

Dr. Salvador A. Gezan

21 June 2022
image_blog

At the present, with the fast development of genotyping, we have access to genomic data that we can use with different statistical and computational approaches, for example, to accelerate genetic gains and to select outstanding genotypes as parents for commercial release. Genomic data is also useful to assess genetic variability and diversity in the context of breeding programs or other research studies.

The power of genomic data

Most of the genomic data used for breeding comes from single nucleotide polymorphism (SNP). Typically, we see this data as a matrix containing several individuals, often thousands, genotyped for a number of SNPs (i.e., nucleotide readings AA, AC, TT, etc.). Depending on the quality and characteristics of this data, we can use it for different analytical purposes. 

In this blog, we will describe some of the uses of this SNP data. We will focus mainly on the available number of SNPs (i.e., markers) and what analyses these enable. Note, the classification of low-, medium- and high-density panels is somewhat arbitrary but it’s what’s often used in plant breeding.

Low-density (LD) panels

Let’s start with what is known as low-density (LD) panels. These typically contain fewer than 200 SNPs, and are the cheapest option you can have (often just a couple of US dollars per sample). With this small number of SNP markers, their main use is for quality assurance (QA) and quality control (QC). That is, they can be used for:

  • Verification of Crosses. If parents are genotyped, then it is possible to check that the correct crosses were performed.
  • Parentage Reconstruction. If parents are genotyped, it is possible to reconstruct the full pedigree of a group of offspring. 
  • Marker Assisted Selection. If a preliminary group of markers were identified to be associated with one or more QTLs of interest, these can easily be incorporated into the panel and used to discriminate genotypes.
  • Population Assignment. A group of markers can be used to assign individuals to different populations (e.g., origins).
  • Assign Sex. Depending on the dynamics of the organism, sometimes it is possible to identify one or more markers that can be used to assign sex to the individuals before they mature.

Other uses are possible but they can be limited. For example, we could identify sibships on some individuals (e.g., full-sibs), and it may be possible to calculate some population statistics of genetic diversity. However, these uses can yield high levels of uncertainty.

An important aspect of LD panels it that they often have markers for verification, and in addition some markers will be missing due to genotyping issues. Hence, their effective size will be lower than their nominal size. Another important consideration is that often these commercial panels are constructed for general use; hence, they may not exactly be based on the population of interest. This will lead to some fixated (MAF = 0%) or almost fixated (MAF < 2%) markers that have little or no contribution to the above uses. 

However, despite the above shortcomings, LD panels constitute a very good and cheap alternative for some small programs to start considering genomic tools in their breeding; and for example, verifying crosses is a critical step!

Medium-density (MD) panels

The medium-density (MD) panels are defined very differently depending on the field. Here, we will consider the plant breeding definition of between 2,000 to 10,000 SNPs (i.e., 2K to 10K). In animal breeding, MD panels will be at least 5 times larger. This larger number of SNPs offers more opportunities, but the cost of an MD panel is several times greater than an LD panel. This increased cost often limits, for most breeding programs, the number of individuals to be genotyped. However, they offer more genomic analysis opportunities. In addition to the uses for LD panels mentioned above, for MD panels we also have:

  • Genomic Relationship Matrix Estimation. It is now possible to estimate with reasonable accuracy the relatedness between any pair of individuals. And these matrices can be used for many other objectives.
  • Genomic Prediction Models. MD panels allows us to fit genomic prediction (GP) models. However, these tend to be on the low level of accuracy, but still sufficient and useful for operational use.
  • Marker Imputation. Missing marker data, if complemented with individuals genotyped using a high-density (HD) panel, can be imputed successfully, and that data can be used for other purposes, like genomic prediction.  
  • Genetic Linkage Maps. This large number of markers allows for the construction of reliable linkage (or genetic) maps with plenty of other uses, such as imputation.
  • QTL Analysis. This MD panel has a reasonable number of markers to perform QTL analysis, for example on recombinant inbred lines (RIL) populations. 
  • Diversity Studies. It is easier to perform several genomic studies that deal with diversity, as it is now possible to follow over generations, for example, inbreeding, effective population size, etc.

In addition, MD panels can also include several markers previously detected for use in Marker Assisted Selection (MAS) or for sex determination. Given the larger number of SNPs, the presence of verification, missing or fixed markers is of less concern, but should be kept under control in order to maximize the usefulness of the panel. Good or poor selection of SNP markers can make a large difference on the life of the panel and the accuracy of the GP models. Hence, it is recommend to ensure markers are well selected for the construction of these panels.

At the present, MD panels should be the panel of choice for most breeding operations. These panels will ensure that the collected data is reasonably good for use in the future, especially as technology and prices will change, justifying the genotyping investment in the long term. In addition, these represent the best panels for breeding programs to start applying (and playing) with more sophisticated genomic tools, such as genomic prediction with GBLUP and/or Bayes B.

High-density (HD) panels

Finally, we have the high-density (HD) panels with more than 20,000 SNPs (20K). But again, this is relative. Most HD panels for plants and aquatic species are in the range of 30K to 70K, but in some species the size considered is in excess of 700K (as in dairy) or even in the millions (as in humans). The cost of an HD panel varies greatly with the number of SNPs considered, the technology used, and number of individuals to genotype. The uses that these panels provide, in addition to all the previous uses for LD and MD panels, are:

  • High Accuracy Genomic Prediction Models. We expect greater accuracy from these panels, possible with a long useful life not requiring tuning over time, and little or no imputation required. 
  • Genome-wide Association Studies (GWAS). This is the key data for the discovery of markers under GWAS studies, where it will be more likely to find a marker positioned directly on a coding area within a functional gene.

These HD panels often are very redundant in their information. Hence, they tend to present a good portion of fixed alleles, and missing data that rules many markers out. However, they are still sufficiently large enough to provide all required above mentioned uses. It is also possible that these panels can be used by a combination of programs or research groups, allowing: 1) pooling of resources, 2) price negotiation, and 3) sharing of genomic information. This is commonly seen in some genomic consortiums. In addition, these HD panels, if originally constructed using a diverse base population, might not require much future revisions.

Another important aspect of HD panels is that they constitute the raw information for imputation in MD panels (and, although only remotely possible, in LD panels) and they will be the ones that in the future will link the array of available panels used in a breeding program (including, LD, MD or from different groups). Hence, HD panels allow us to connect different sources of genomic data.

Considerations for genomic selection

One important note is in relation to the number of markers required for genomic selection (GS). We recommend at least an MD panel is used for this, with no fewer than 2K useful (post-filtering) SNPs. It is interesting to note that some studies have reported that dropping the number of SNPs to 1K or fewer results in a considerable loss of accuracy of the GP models. And, also, often more than 10K SNPs does not yield considerably better accuracies than 5K SNPs. In addition, some studies have successfully focused on 2-3K SNP panels supported with imputation from an HD panel, with an interesting increase in accuracy. Hence, there are plenty of options to exploit MD panels.

Maximizing GP model accuracy

Another aspect is in relation to maximizing the accuracy of GP models. Of course, the more informative SNPs available the better! But there are many other aspects that affect, in good or bad ways, the success of a genomic model. For example, the level of relatedness between the training and evaluation populations, linkage-disequilibrium, genetic architecture of the traits, and of course heritability of the traits of interest. All of these elements may eventually tilt the decision from one type of panel to another.

Overcoming challenges with specific traits

Another difficulty that can arise is the detection of markers associated with some traits that are present in the population at very low rates, such as the case of ‘standing genetic variation’. This implies that it is difficult to find these markers on most panels as they will tend to be dropped early. Therefore, specific or very large HD panels might be needed in these cases.

Importance of marker pre-selection

Finally, as mentioned before, the use of LD or MD panels requires a careful pre-selection of markers to make the most of these panels. If this is done poorly, or for another population (e.g., using an available panel developed for another breeding group), then the benefits of the corresponding panels will possibly be greatly reduced. This also implies that, particularly for the LD panel, the set of markers in use has to be constantly reviewed as the population changes over time or new markers from MAS or sex determination are discovered.

Future prospects for SNP panels

As the cost and offering of these panels changes constantly, we suspect that at some point we will be able to afford HD panels for a few cents (or pennies)! But before we get there, we need to make the most of our current resources, and gathering the right data for the right analysis is critical.

About the author

Dr. Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world. 

Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook Statistical Methods in Biology: Design and Analysis of Experiments and Regression.