What are A, G and H matrices and when do we use them?

What are A, G and H matrices and when do we use them?

VSNi User Avatar

27 July 2021
image_blog

My colleague, Amanda Avelar de Oliveira, gave a presentation on ASRgenomics in which she talked a lot about A, G and H matrices. I was unfamiliar with these terms so afterwards I tried Googling and got absolutely nowhere! Lots of lengthy papers and highly detailed explanations, but no short and simple definitions. So, I had a chat with Amanda to try and get a straight answer. Here’s what I learned.

Jane
During your ASRgenomics [1] talk I was Googling "What is the difference between genomic and pedigree matrices". I gave up.

I imagine the explanation is something like this:

  • A matrix = we have been recording on paper or computer, which individuals were bred together and which progeny resulted.
  • G matrix = we can look at the genomes of the individuals and see which ones are related

But this is just my guess!

Amanda
Your definition is right, but in a very informal way. Let’s consider this paragraph in the book Genetic Data Analysis for Plant and Animal Breeding by Fikret Isik, James Holland, Christian Maltecca.

alt text

Traditional genetic evaluations combined the phenotypic data and resemblance coefficients between relatives to predict the genetic merit of individuals. The resemblance coefficients, derived from pedigrees, are based on probabilities that alleles are identical by descent (IBD). The resulting matrix of pairwise pedigree relationships is referred to as the A matrix because the elements in the matrix are pedigree-based estimates of additive genetic relationships.

More recently, genetic markers distributed throughout the entire genome have been used to measure genetic similarities more precisely than by using pedigree information (VanRaden, 2008). Genetic markers estimate the proportion of chromosome segments shared by individuals based on the identical by state (IBS) matching of marker alleles.  The matrix of pairwise realized genomic relationships estimated from marker information is referred to as the G matrix.

DNA segment in two or more individuals is

Jane
Ok, so if we can get genomic relationship information why would we even bother with the A matrix any more, (unless we do not have access to DNA data?). Why does it make sense to combine the A and G matrices to create an H matrix? Why not just chuck out the A matrix?

Amanda 
Basically, to get genotypic information we need to genotype the individuals in our population, which is really expensive! Pedigree information can be easier and cheaper to obtain. In animal breeding, for example, breeders have very good pedigree information - they basically know the whole ancestry of an individual for generations! In plants, it is less common to have a good pedigree, especially for annual crops.

Jane
So, the pedigree information is still valuable, particularly if you only have partial genomic information.

Amanda
Absolutely. And when you trust your pedigree, you can also use it to identify possible mistakes in your genomic data.

Jane
Ooer, how can genomic data be mistaken? Perhaps the person collecting the data mixed up the samples, for example?

Amanda
Yes. Or even errors in the lab, contamination by pollen or semen, and so on. There are many possible sources of error.

Jane
So, the final H matrix, combining the A and G matrices is useful...how?

Amanda
To ensure good quality genomic data it needs to be genotyped with a high coverage. This is a very expensive process, and generally people cannot afford to do much of it. The H matrix is useful because it combines the information from the pedigree with genomic marker information, enabling you to use all the information on genetic similarity you have. For example, if you only have money to genotype 100 individuals, but you have pedigree (and phenotype) information on 1000 individuals, you can run your analysis on all 1000 individuals by using an H matrix to combine the genomic and pedigree information. We call it the H-matrix because it’s a hybrid of the pedigree-based A matrix and the genomic-based G matrix.

Jane
Great! That makes sense. Thank you so much.

Amanda
No problem.

A matrix: contains pedigree-based estimates of additive genetic relationships

G matrix: contains realized genomic relationships estimated from marker information

H matrix: combined estimates of the genetic relationships from the pedigree and genomic (i.e., marker) information.

[1]  ASRgenomics is an R package that provides a series of molecular and genetic routines to assist in analytical pipelines for Genomic Selection and/or Genome-Wide Association Studies (GWAS). It can be used both before and after genomic analysis by ASReml-R (or another R library). The main routines included are used for:

  • Preparing and exploring the phenotypic and genetic data.
  • Generating the genomic-based matrix (G) (and its inverse).
  • Tuning up the genomic matrix and preparing it for downstream analyses.
  • Generating and exploring the hybrid genomic matrix (H).

About the authors

Amanda Avelar de Oliveira is an Agronomist with M.Sc and Ph.D. in Genetics and Plant Breeding from the University of São Paulo (ESALQ/USP). She has experience on quantitative genetics, genomic prediction, field trial analysis and genotyping pipelines. Currently, she works as a consultant at VSN International, UK.

“I believe in the power of knowledge sharing and multidisciplinary efforts to increase genetic gains in plant breeding while ensuring sustainability in agriculture”.

Jane Cohen is a technical author with a bachelor's degree in information technology and graduate diploma in technical communication. She is a self-confessed science nerd with a keen interest in biology and statistics. When she's not honing her DIY skills she can generally be found either with a book in hand or perambulating in the countryside.

Related Reads

READ MORE
VSNi User Avatar

The VSNi Team

04 May 2021

What is a p-value?

A way to decide whether to reject the null hypothesis (H0) against our alternative hypothesis (H1) is to determine the probability of obtaining a test statistic at least as extreme as the one observed under the assumption that H0 is true. This probability is referred to as the “p-value”. It plays an important role in statistics and is critical in most biological research.

alt text

What is the true meaning of a p-value and how should it be used?

P-values are a continuum (between 0 and 1) that provide a measure of the strength of evidence against H0. For example, a value of 0.066, will indicate that there is a probability that we could observe values as large or larger than our critical value with a probability of 6.6%. Note that this p-value is NOT the probability that our alternative hypothesis is correct, it is only a measure of how likely or unlikely we are to observe these extreme events, under repeated sampling, in reference to our calculated value. Also note that this p-value is obtained based on an assumed distribution (e.g., t-distribution for a t-test); hence, p-value will depend strongly on your (correct or incorrect) assumptions.

The smaller the p-value, the stronger the evidence for rejecting H0. However, it is difficult to determine what a small value really is. This leads to the typical guidelines of: p < 0.001 indicating very strong evidence against H0, p < 0.01 strong evidence, p < 0.05 moderate evidence, p < 0.1 weak evidence or a trend, and p ≥ 0.1 indicating insufficient evidence [1], and a strong debate on what this threshold should be. But declaring p-values as being either significant or non-significant based on an arbitrary cut-off (e.g. 0.05 or 5%) should be avoided. As Ronald Fisher said:

“No scientific worker has a fixed level of significance at which, from year to year, and in all circumstances he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” [2].

A very important aspect of the p-value is that it does not provide any evidence in support of H0 – it only quantifies evidence against H0. That is, a large p-value does not mean we can accept H0. Take care not to fall into the trap of accepting H0! Similarly, a small p-value tells you that rejecting H0 is plausible, and not that H1 is correct!

For useful conclusions to be drawn from a statistical analysis, p-values should be considered alongside the size of the effect. Confidence intervals are commonly used to describe the size of the effect and the precision of its estimate. Crucially, statistical significance does not necessarily imply practical (or biological) significance. Small p-values can come from a large sample and a small effect, or a small sample and a large effect.

It is also important to understand that the size of a p-value depends critically on the sample size (as this affects the shape of our distribution). Here, with a very very large sample size, H0 may be always rejected even with extremely small differences, even if H0 is nearly (i.e., approximately) true. Conversely, with very small sample size, it may be nearly impossible to reject H0 even if we observed extremely large differences. Hence, p-values need to also be interpreted in relation to the size of the study.

References

[1] Ganesh H. and V. Cave. 2018. P-values, P-values everywhere! New Zealand Veterinary Journal. 66(2): 55-56.

[2] Fisher RA. 1956. Statistical Methods and Scientific Inferences. Oliver and Boyd, Edinburgh, UK.

READ MORE
VSNi User Avatar

Dr. Vanessa Cave

10 May 2022

The essential role of statistical thinking in animal ethics: dealing with reduction

Having spent over 15 years working as an applied statistician in the biosciences, I’ve come across my fair-share of animal studies. And one of my greatest bugbears is that the full value is rarely extracted from the experimental data collected. This could be because the best statistical approaches haven’t been employed to analyse the data, the findings are selectively or incorrectly reported, other research programmes that could benefit from the data don’t have access to it, or the data aren’t re-analysed following the advent of new statistical methods or tools that have the potential to draw greater insights from it.


An enormous number of scientific research studies involve animals, and with this come many ethical issues and concerns. To help ensure high standards of animal welfare in scientific research, many governments, universities, R&D companies, and individual scientists have adopted the principles of the 3Rs: Replacement, Reduction and Refinement. Indeed, in many countries the tenets of the 3Rs are enshrined in legislation and regulations around the use of animals in scientific research.

Replacement

Use methods or technologies that replace or avoid the use of animals.

Reduction

Limit the number of animals used.

Refinement

Refine methods in order to minimise or eliminate negative animal welfare impacts.

In this blog, I’ll focus on the second principle, Reduction, and argue that statistical expertise is absolutely crucial for achieving reduction.

The aim of reduction is to minimise the number of animals used in scientific research whilst balancing against any additional adverse animal welfare impacts and without compromising the scientific value of the research. This principle demands that before carrying out an experiment (or survey) involving animals, the researchers must consider and implement approaches that both:

  1. Minimise their current animal use – the researchers must consider how to minimise the number of animals in their experiment whilst ensuring sufficient data are obtained to answer their research questions, and
  2. Minimise future animal use – the researchers need to consider how to maximise the information obtained from their experiment in order to potentially limit, or avoid, the subsequent use of additional animals in future research.

Both these considerations involve statistical thinking. Let’s begin by exploring the important role statistics plays in minimising current animal use.

Statistical aspects to minimise current animal use

Reduction requires that any experiment (or survey) carried out must use as few animals as possible. However, with too few animals the study will lack the statistical power to draw meaningful conclusions, ultimately wasting animals. But how do we determine how many animals are needed for a sufficiently powered experiment? The necessary starting point is to establish clearly defined, specific research questions. These can then be formulated into appropriate statistical hypotheses, for which an experiment (or survey) can be designed. 

Statistical expertise in experimental design plays a pivotal role in ensuring enough of the right type of data are collected to answer the research questions as objectively and as efficiently as possible. For example, sophisticated experimental designs involving blocking can be used to reduce random variation, making the experiment more efficient (i.e., increase the statistical power with fewer animals) as well as guarding against bias. Once a suitable experimental design has been decided upon, a power analysis can be used to calculate the required number of animals (i.e., determine the sample size). Indeed, a power analysis is typically needed to obtain animal ethics approval - a formal process in which the benefits of the proposed research is weighed up against the likely harm to the animals. 

Researchers also need to investigate whether pre-existing sources of information or data could be integrated into their study, enabling them to reduce the number of animals required. For example, by means of a meta-analysis. At the extreme end, data relevant to the research questions may already be available, eradicating the need for an experiment altogether! 

Statistical aspects to minimise future animal use: doing it right the first time

An obvious mechanism for minimising future animal use is to ensure we do it right the first time, avoiding the need for additional experiments. This is easier said than done; there are many statistical and practical considerations at work here. The following paragraphs cover four important steps in experimental research in which statistical expertise plays a major role: data acquisition, data management, data analysis and inference.

Above, I alluded to the validity of the experimental design. If the design is flawed, the data collected will be compromised, if not essentially worthless. Two common mistakes to avoid are pseudo-replication and the lack of (or poor) randomisation. Replication and randomisation are two of the basic principles of good experimental design. Confusing pseudo-replication (either at the design or analysis stage) for genuine replication will lead to invalid statistical inferences. Randomisation is necessary to ensure the statistical inference is valid and for guarding against bias. 

Another extremely important consideration when designing an experiment, and setting the sample size, is the risk and impact of missing data due, for example, to animal drop-out or equipment failure. Missing data results in a loss of statistical power, complicates the statistical analysis, and has the potential to cause substantial bias (and potentially invalidate any conclusions). Careful planning and management of an experiment will help minimise the amount of missing data. In addition, safe-guards, controls or contingencies could be built into the experimental design that help mitigate against the impact of missing data. If missing data does result, appropriate statistical methods to account for it must be applied. Failure to do so could invalidate the entire study.

It is also important that the right data are collected to answer the research questions of interest. That is, the right response and explanatory variables measured at the appropriate scale and frequency. There are many statistical related-questions the researchers must answer, including: what population do they want to make inference about? how generalisable do they need their findings to be? what controllable and uncontrollable variables are there? Answers to these questions not only affects enrolment of animals into the study, but also the conditions they are subjected to and the data that should be collected. 

It is essential that the data from the experiment (including meta-data) is appropriately managed and stored to protect its integrity and ensure its usability. If the data get messed up (e.g., if different variables measured on the same animal cannot be linked), is undecipherable (e.g., if the attributes of the variables are unknown) or is incomplete (e.g., if the observations aren’t linked to the structural variables associated with the experimental design), the data are likely worthless. Statisticians can offer invaluable expertise in good data management practices, helping to ensure the data are accurately recorded, the downstream results from analysing the data are reproducible and the data itself is reusable at a later date, by possibly a different group of researchers.

Unsurprisingly, it is also vitally important that the data are analysed correctly, using the methods that draw the most value from it. As expected, statistical expertise plays a huge role here! The results and inference are meaningful only if appropriate statistical methods are used. Moreover, often there is a choice of valid statistical approaches; however, some approaches will be more powerful or more precise than others. 

Having analysed the data, it is important that the inference (or conclusions) drawn are sound. Again, statistical thinking is crucial here. For example, in my experience, one all too common mistake in animal studies is to accept the null hypothesis and erroneously claim that a non-significant result means there is no difference (say, between treatment means). 

Statistical aspects to minimise future animal use: sharing the value from the experiment

The other important mechanism for minimising future animal use is to share the knowledge and information gleaned. The most basic step here is to ensure that all the results are correctly and non-selectively reported. Reporting all aspects of the trial, including the experimental design and statistical analysis, accurately and completely is crucial for the wider interpretation of the findings, reproducibility and repeatability of the research, and for scientific scrutiny. In addition, all results, including null results, are valuable and should be shared. 

Sharing the data (or resources, e.g., animal tissues) also contributes to reduction. The data may be able to be re-used for a different purpose, integrated with other sources of data to provide new insights, or re-analysed in the future using a more advanced statistical technique, or for a different hypothesis. 

Statistical aspects to minimise future animal use: maximising the information obtained from the experiment

Another avenue that should also be explored is whether additional data or information can be obtained from the experiment, without incurring any further adverse animal welfare impacts, that could benefit other researchers and/or future studies. For example, to help address a different research question now or in the future. At the outset of the study, researchers must consider whether their proposed study could be combined with another one, whether the research animals could be shared with another experiment (e.g., animals euthanized for one experiment may provide suitable tissue for use in another), what additional data could be collected that may (or is!) of future use, etc. 

Statistical thinking clearly plays a fundamental role in reducing the number of animals used in scientific research, and in ensuring the most value is drawn from the resulting data. I strongly believe that statistical expertise must be fully utilised through the duration of the project, from design through to analysis and dissemination of results, in all research projects involving animals to achieving reduction. In my experience, most researchers strive for very high standards of animal ethics, and absolutely do not want to cause unnecessary harm to animals. Unfortunately, the role statistical expertise plays here is not always appreciated or taken advantage of. So next time you’re thinking of undertaking research involving animals, ensure you have expert statistical input!

About the author

Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems.  Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.

READ MORE
VSNi User Avatar

Kanchana Punyawaew and Dr. Vanessa Cave

01 March 2021

Mixed models for repeated measures and longitudinal data

The term "repeated measures" refers to experimental designs or observational studies in which each experimental unit (or subject) is measured repeatedly over time or space. "Longitudinal data" is a special case of repeated measures in which variables are measured over time (often for a comparatively long period of time) and duration itself is typically a variable of interest.

In terms of data analysis, it doesn’t really matter what type of data you have, as you can analyze both using mixed models. Remember, the key feature of both types of data is that the response variable is measured more than once on each experimental unit, and these repeated measurements are likely to be correlated.

Mixed Model Approaches

To illustrate the use of mixed model approaches for analyzing repeated measures, we’ll examine a data set from Landau and Everitt’s 2004 book, “A Handbook of Statistical Analyses using SPSS”. Here, a double-blind, placebo-controlled clinical trial was conducted to determine whether an estrogen treatment reduces post-natal depression. Sixty three subjects were randomly assigned to one of two treatment groups: placebo (27 subjects) and estrogen treatment (36 subjects). Depression scores were measured on each subject at baseline, i.e. before randomization (predep) and at six two-monthly visits after randomization (postdep at visits 1-6). However, not all the women in the trial had their depression score recorded on all scheduled visits.

In this example, the data were measured at fixed, equally spaced, time points. (Visit is time as a factor and nVisit is time as a continuous variable.) There is one between-subject factor (Group, i.e. the treatment group, either placebo or estrogen treatment), one within-subject factor (Visit or nVisit) and a covariate (predep).

alt text

Using the following plots, we can explore the data. In the first plot below, the depression scores for each subject are plotted against time, including the baseline, separately for each treatment group.

alt text

In the second plot, the mean depression score for each treatment group is plotted over time. From these plots, we can see variation among subjects within each treatment group that depression scores for subjects generally decrease with time, and on average the depression score at each visit is lower with the estrogen treatment than the placebo.

alt text

Random effects model

The simplest approach for analyzing repeated measures data is to use a random effects model with subject fitted as random. It assumes a constant correlation between all observations on the same subject. The analysis objectives can either be to measure the average treatment effect over time or to assess treatment effects at each time point and to test whether treatment interacts with time.

In this example, the treatment (Group), time (Visit), treatment by time interaction (Group:Visit) and baseline (predep) effects can all be fitted as fixed. The subject effects are fitted as random, allowing for constant correlation between depression scores taken on the same subject over time.

The code and output from fitting this model in ASReml-R 4 follows;

alt text

alt text

alt text

The output from summary() shows that the estimate of subject and residual variance from the model are 15.10 and 11.53, respectively, giving a total variance of 15.10 + 11.53 = 26.63. The Wald test (from the wald.asreml() table) for predep, Group and Visit are significant (probability level (Pr) ≤ 0.01). There appears to be no relationship between treatment group and time (Group:Visit) i.e. the probability level is greater than 0.05 (Pr = 0.8636).

Covariance model

In practice, often the correlation between observations on the same subject is not constant. It is common to expect that the covariances of measurements made closer together in time are more similar than those at more distant times. Mixed models can accommodate many different covariance patterns. The ideal usage is to select the pattern that best reflects the true covariance structure of the data. A typical strategy is to start with a simple pattern, such as compound symmetry or first-order autoregressive, and test if a more complex pattern leads to a significant improvement in the likelihood.

Note: using a covariance model with a simple correlation structure (i.e. uniform) will provide the same results as fitting a random effects model with random subject.

In ASReml-R 4 we use the corv() function on time (i.e. Visit) to specify uniform correlation between depression scores taken on the same subject over time.

alt text

Here, the estimate of the correlation among times (Visit) is 0.57 and the estimate of the residual variance is 26.63 (identical to the total variance of the random effects model, asr1).

Specifying a heterogeneous first-order autoregressive covariance structure is easily done in ASReml-R 4 by changing the variance-covariance function in the residual term from corv() to ar1h().

alt text

Random coefficients model

When the relationship of a measurement with time is of interest, a random coefficients model is often appropriate. In a random coefficients model, time is considered a continuous variable, and the subject and subject by time interaction (Subject:nVisit) are fitted as random effects. This allows the slopes and intercepts to vary randomly between subjects, resulting in a separate regression line to be fitted for each subject. However, importantly, the slopes and intercepts are correlated.

The str() function of asreml() call is used for fitting a random coefficient model;

alt text

The summary table contains the variance parameter for Subject (the set of intercepts, 23.24) and Subject:nVisit (the set of slopes, 0.89), the estimate of correlation between the slopes and intercepts (-0.57) and the estimate of residual variance (8.38).

References

Brady T. West, Kathleen B. Welch and Andrzej T. Galecki (2007). Linear Mixed Models: A Practical Guide Using Statistical Software. Chapman & Hall/CRC, Taylor & Francis Group, LLC.

Brown, H. and R. Prescott (2015). Applied Mixed Models in Medicine. Third Edition. John Wiley & Sons Ltd, England.

Sabine Landau and Brian S. Everitt (2004). A Handbook of Statistical Analyses using SPSS. Chapman & Hall/CRC Press LLC.