logo
Software
Sector
Learning
Resources
Blogs
About us
Contact

Welcome to the VSNi Blog

We've collated a number of papers to illustrate how VSNi software is supporting research across the globe
Most Recent
READ MORE

21 hours ago
What are A, G and H matrices and when do we use them?

> _My colleague, Amanda Avelar de Oliveira, gave a presentation on_ [_ASRgenomics_](https://asreml.kb.vsni.co.uk/asrgenomics-download-success/) _in which she talked a lot about A, G and H matrices. I was unfamiliar with these terms so afterwards I tried Googling and got absolutely nowhere! Lots of lengthy papers and highly detailed explanations, but no short and simple definitions. So, I had a chat with Amanda to try and get a straight answer. Here’s what I learned._ **Jane** During your [ASRgenomics](https://asreml.kb.vsni.co.uk/asrgenomics-download-success/) [\[1\]](#_ftn1) talk I was Googling "What is the difference between genomic and pedigree matrices". I gave up. I imagine the explanation is something like this: * G matrix = we can look at the genomes of the individuals and see which ones are related * A matrix = we have been recording on paper or computer, which individuals were bred together and which progeny resulted. But this is just my guess! ​**Amanda** Your definition is right, but in a very informal way. Let’s consider this paragraph in the book _Genetic Data Analysis for Plant and Animal Breeding_ by Fikret Isik, James Holland, Christian Maltecca. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_matrices_Fiskret_1b338035ec.jpg) Traditional genetic evaluations combined the phenotypic data and resemblance coefficients between relatives to predict the genetic merit of individuals. The resemblance coefficients, derived from pedigrees, are based on probabilities that alleles are identical by descent (IBD). The resulting matrix of pairwise pedigree relationships is referred to as the A matrix because the elements in the matrix are pedigree-based estimates of additive genetic relationships. More recently, genetic markers distributed throughout the entire genome have been used to measure genetic similarities more precisely than by using pedigree information ([VanRaden, 2008](https://www.sciencedirect.com/science/article/pii/S0022030208709901)). Genetic markers estimate the proportion of chromosome segments shared by individuals based on the identical by state (IBS) matching of marker alleles.  The matrix of pairwise realized genomic relationships estimated from marker information is referred to as the G matrix. <table style="border-bottom:2px solid hsl(240, 75%, 60%);border-left:2px solid hsl(240, 75%, 60%);border-right:2px solid hsl(240, 75%, 60%);border-top:2px solid hsl(240, 75%, 60%);"><tbody><tr><td><p>A&nbsp;<a href="https://en.wikipedia.org/wiki/DNA"><span style="color:windowtext;">DNA</span></a> segment in two or more individuals is</p><ul><li><a href="https://en.wikipedia.org/wiki/Identity_by_type"><span style="color:windowtext;"><strong>identical by state</strong></span></a><strong> (IBS)</strong> if the&nbsp;<a href="https://en.wikipedia.org/wiki/Nucleotide_sequences"><span style="color:windowtext;">nucleotide sequences</span></a> are identical in this segment</li><li><strong>identical by descent</strong>&nbsp;<strong>(IBD)</strong> if it has been inherited from a common ancestor without&nbsp;<a href="https://en.wikipedia.org/wiki/Genetic_recombination"><span style="color:windowtext;">recombination</span></a></li></ul></td></tr></tbody></table> **Jane** Ok, so why would we even bother with the A matrix any more,  (unless we do not have access to DNA data?). Why does it make sense to combine the A and G matrices to create an H matrix? Why not just chuck out the A matrix? ​**Amanda**  Basically, to get genotypic information we need to genotype the individuals in our population, which is really expensive! Pedigree information can be easier and cheaper to obtain. In animal breeding, for example, breeders have very good pedigree information - they basically know the whole ancestry of an individual. In plants, it is less common to have a good pedigree, especially for annual crops. ​**Jane** So, the pedigree information is still valuable, particularly if you only have partial genomic information. **Amanda** Absolutely. And when you trust your pedigree, you can also use it to identify possible mistakes in your genomic data. ​**Jane** Ooer, how can genomic data be mistaken? Perhaps the person collecting the data mixed up the samples, for example? ​**Amanda** Yes. Or even errors in the lab. ​**Jane** So, the final H matrix, combining the A and G matrices is useful...how? ​**Amanda** To ensure good quality genomic data it needs to be genotyped with a high coverage. This is a very expensive process, and generally people cannot afford to do much of it. The H matrix is useful because it combines the information from the pedigree with genomic marker information, enabling you to use all the information on genetic similarity you have. For example, if you only have money to genotype 100 individuals, but you have the pedigree of 1000 individuals, you can run your analysis on all 1000 individuals by using an H matrix to combine the genomic and pedigree information. We call it the H-matrix because it’s a _hybrid_ of the pedigree-based A matrix and the genomic-based G matrix. ​**Jane** Great! That makes sense. Thank you so much. ​**Amanda** No problem. <table style="border-bottom:2px solid hsl(240, 75%, 60%);border-left:2px solid hsl(240, 75%, 60%);border-right:2px solid hsl(240, 75%, 60%);border-top:2px solid hsl(240, 75%, 60%);"><tbody><tr><td><p><strong>A matrix:</strong> contains pedigree-based estimates of additive genetic relationships</p><p><strong>G matrix:</strong> contains realized genomic relationships estimated from marker information</p><p><strong>H matrix:</strong> combined estimates of the genetic relationships from the pedigree and genomic (i.e., marker) information.</p></td></tr></tbody></table> <table style="border-bottom:2px solid hsl(240, 75%, 60%);border-left:2px solid hsl(240, 75%, 60%);border-right:2px solid hsl(240, 75%, 60%);border-top:2px solid hsl(240, 75%, 60%);"><tbody><tr><td><p><a href="#_ftnref1">[1]</a>&nbsp;<a href="https://asreml.kb.vsni.co.uk/asrgenomics-download-success/"> ASRgenomics</a> is an R package that provides a series of molecular and genetic routines to assist in analytical pipelines for Genomic Selection and/or Genome-Wide Association Studies (GWAS). It can be used both before and after genomic analysis by&nbsp;<a href="https://asreml.kb.vsni.co.uk">ASReml-R</a> (or another R library). The main routines included are used for:</p><ul><li>Preparing and exploring the pedigree, phenotypic and genetic data.</li><li>Generating the pedigree-based kinship matrix (A).</li><li>Generating the (inverse) genomic-based matrix (G).</li><li>Generating and exploring the hybrid genomic matrix (H).</li><li>Extracting genomic breeding values.</li></ul></td></tr></tbody></table> #### **About the author** Amanda Avelar de Oliveira is an Agronomist with M.Sc and Ph.D. in Genetics and Plant Breeding from the University of São Paulo (ESALQ/USP). She has experience on quantitative genetics, genomic prediction, field trial analysis and genotyping pipelines. Currently, she works as a consultant at VSN International, UK. > “I believe in the power of knowledge sharing and multidisciplinary efforts to increase genetic gains in plant breeding while ensuring sustainability in agriculture”.

READ MORE

Amanda Avelar de Oliveira PhD

7 days ago
With the release of ASReml 4.2, has the best just got better?

Linear mixed models are broadly used to analyse different types of data produced by a wide variety of sectors, e.g., animal, plant and aqua breeding, agriculture, environmental and medical sciences. These models provide a rich and flexible tool to model the complex variance-covariance structure in the data, allowing for different sources of variation. [ASReml](https://www.vsni.co.uk/software/asreml) is a comprehensive and powerful statistical software package specially designed for linear mixed model analysis that is used across all fields of scientific research around the world. It has a strong theoretical and statistical background providing reliable estimation and inference. ASReml fits linear mixed models using Residual Maximum Likelihood (REML). Its REML routine produces parameter estimates that are consistent and efficient. ASReml uses the Average Information (AI) algorithm and sparse matrix methods, which enables it to rapidly solve large numbers of mixed model equations.  #### **The scientists’ choice for analysis of linear mixed models** ASReml has become the default software for analysis of linear mixed models by scientists like me. It has been cited thousands of times in scientific publications and is widely used by many industries for their commercial operations. Data analysts choose ASReml because it is faster and computationally more efficient than other software packages for solving mixed model equations. Its flexible syntax, and the wide range of variance model structures offered, enable ASReml users to analyse complex data, such as multi-environment and multi-trait data, and to accommodate pedigree and molecular information easily. It also allows for the analysis of large and messy data sets and makes fitting complex linear mixed models possible and easy. ASReml provides theoretically advanced approaches for the mixed model analysis of Normal and discrete response variables, as well as analysis of correlated data, repeated measures, multivariate analyses, and complex experimental designs with balanced or unbalanced datasets. Additionally, ASReml’s user-friendly interface makes it easy to run analyses and yield effective results quickly. #### **More data requires more processing capability**  The amount of data generated by scientists nowadays is much larger compared to decades ago and statistical software needs to evolve to keep up with this increase. New technologies have substantially changed the way data are being generated. Data is the fuel that drives many important decisions in our society. Our lives can be transformed by having the right tools and software to analyse data and extract its full potential. The new ASReml 4.2 release brings much more power to the analysis of large datasets, and other improved features to help its users. The most significant changes are an increase in available memory, up from 32 to 96 Gigabytes workspace, and a reorganization of some core routines which enable ASReml 4.2 to run substantially faster. The memory increase allows users to analyse larger problems in less time. On multi-user systems, memory efficiency is maximised by allowing each user to specify the amount of memory needed for their current session (the maximum allocated depends on the machine availability). Regarding the reorganization of core routines, there are 10 areas where ASReml 4.2 is faster than the previous 4.1 version released in 2015. For instance, in jobs with relatively dense G matrices, computation times are often reduced by more than 40%. #### **Multi-thread parallel processing for faster results**  ASReml 4.2 has been optimized in several areas for multi-threaded analysis processing, making it possible to use the maximum processing power available. Multi-thread processing is a mechanism to accomplish higher performance by partitioning and processing the data in multiple threads simultaneously. Users can specify the maximum number of threads to be used; by default ASReml uses all threads available, up to 16. This parallel processing delivers some significant gains in speed and on occasions can be up to 70x faster. These gains depend upon a variety of factors including microprocessor, machine power, data set size and type of analysis run, among others. #### **New analytical procedures**  Beyond the optimization in terms of memory and speed, some new procedures have been implemented in ASReml 4.2. Pedigree qualifiers have been extended to allow the removal of unnecessary individuals. Pedigree trimming has been implemented along with the option of absorbing parents without data. This pedigree pre-processing removes unnecessary individuals from a pedigree, speeding up likelihood evaluation while maintaining proper relationships among the core members, saving computation time. Another new procedure is the extension of the bivariate analysis of generalized linear models. Previously in the 4.1 version, ASReml allowed a bivariate analysis of a binomially distributed variate and a normally distributed variate with an identity link. ASReml 4.2 has extended this bivariate analysis and both variates can now be distributed with Normal, Binomial, Poisson, Gamma or Negative Binomial distributions. #### **Yes – the best is now better**  So, in answer to my question, yes, I believe the best did just get better! You can decide for yourself. For further details of the new features, improvements and updates in this release, users can check the [VSNi ASReml Knowledge Base.](https://asreml.kb.vsni.co.uk/knowledge-base/new-features-benefits-asreml-4-2/) #### **About the author** Amanda Avelar de Oliveira is an Agronomist with M.Sc and Ph.D. in Genetics and Plant Breeding from the University of São Paulo (ESALQ/USP). She has experience on quantitative genetics, genomic prediction, field trial analysis and genotyping pipelines. Currently, she works as a consultant at VSN International, UK. > “I believe in the power of knowledge sharing and multidisciplinary efforts to increase genetic gains in plant breeding while ensuring sustainability in agriculture”.

READ MORE

The VSNi Team

15 days ago
Using blocking to improve precision and avoid bias

When conducting an experiment, an important consideration is how to even out the variability among the experimental units to make comparisons between the treatments fair and precise. Ideally, we should try to minimize the variability by carefully controlling the conditions under which we conduct the experiment. However, there are many situations where the experimental units are non-uniform. For example: * in a field experiment laid out on a slope, the plots at the bottom of the slope may be more fertile than the plots at the top, * in a medical trial, the weight and age of subjects may vary. When you know there are differences between the experimental units (and these differences may potentially affect your response), you can improve precision and avoid bias by _**blocking**_. Blocking involves grouping the experimental units into more-or-less homogeneous groups, so that the experimental units within each block are as alike as possible. For example, in the field experiment described above, plots would be blocked (i.e., grouped) according to slope, and in medical trial, subjects would be blocked into groups of similar weight and age. Once the blocks are formed, the treatments are then randomized to the experimental units within each block.  <table style="border-bottom:2px ridge hsl(240, 75%, 60%);border-left:2px ridge hsl(240, 75%, 60%);border-right:2px ridge hsl(240, 75%, 60%);border-top:2px ridge hsl(240, 75%, 60%);"><tbody><tr><td><i>Blocking&nbsp;is used to control nuisance variation by creating homogeneous groups of experimental units, known as blocks.&nbsp;</i></td></tr></tbody></table> **Blocking can improve precision** Let’s look at an example[\[1\]](#_ftn1) to see how _**blocking improves the precision**_ of an experiment by reducing the unexplained variation. In this field trial, the yields (pounds per plot) of four strains of Gallipoli wheat were studied. During the design phase, the 20 experimental plots were grouped into five blocks (each containing 4 plots). Within each block, the four wheat strains (A, B, C and D) were randomly assigned to the plots. This is an example of a randomized complete block design (RCBD). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_blocking1_ab7e105b9c.jpg) <table style="border-bottom:2px ridge hsl(240, 75%, 60%);border-left:2px ridge hsl(240, 75%, 60%);border-right:2px ridge hsl(240, 75%, 60%);border-top:2px ridge hsl(240, 75%, 60%);"><tbody><tr><td><p><i>In randomized complete block design (RCBD)…</i></p><ul><li><i>the experimental units are grouped into homogeneous blocks,&nbsp;</i></li><li><i>each block has the same number of experimental units, usually one for each treatment,</i></li><li><i>within each block, the treatments are randomly allocated to the experimental units so that each treatment occurs in each block the same number of times.</i></li></ul></td></tr></tbody></table> To demonstrate the advantage of blocking, we’ll analyse the data in [Genstat](https://www.vsni.co.uk/software/genstat)[\[2\]](#_ftn1)as both a completely randomized design (CRD, which ignores the blocking), and as a RCBD (which takes the blocking into account). One of the assumptions behind a CRD is that the set of experimental units to which the treatments are applied are effectively homogeneous. <table><tbody><tr><td><p style="text-align:center;"><span style="color:hsl(240,75%,60%);"><strong>CRD</strong></span></p></td><td><p style="text-align:center;"><span style="color:hsl(240,75%,60%);"><strong>RCBD</strong></span></p></td></tr><tr><td><figure class="image"><img src="https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_blocking2_a7b6bdd131.jpg" alt="alt text"></figure></td><td><figure class="image"><img src="https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_blocking3_b49da88906.jpg" alt="alt text"></figure></td></tr></tbody></table> <table><tbody><tr><td><p style="text-align:center;"><span style="color:hsl(240,75%,60%);"><strong>CRD</strong></span></p></td></tr><tr><td><figure class="image"><img src="https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_blocking4_18bc285329.jpg" alt="alt text"></figure></td></tr></tbody></table> <table><tbody><tr><td><p style="text-align:center;"><span style="color:hsl(240,75%,60%);"><strong>RCBD</strong></span></p></td></tr><tr><td><figure class="image"><img src="https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_blocking5_d03206115d.jpg" alt="alt text"></figure></td></tr></tbody></table> The ANOVA tables from the two analyses are given above. Notice that the ANOVA table for the RCBD has an additional line, “Blocks stratum”. This records the variation _between_ blocks. Also note that the treatment effects (i.e., strains) are now estimated in the “Blocks.\*Units\*  stratum”, which represents the variation _within_ blocks. As a result: **A**: the residual mean square (i.e., the unexplained variation) has decreased from 2.983 to 2.188 **B**: the standard error of the difference (s.e.d.) has decreased from 1.092 to 0.936. That is, blocking has improved the precision of the experiment! This increase in precision means that we have a better chance of detecting differences between the wheat strains, making this experiment more efficient and with increased statistical power.  <table style="border-bottom:2px ridge hsl(240, 75%, 60%);border-left:2px ridge hsl(240, 75%, 60%);border-right:2px ridge hsl(240, 75%, 60%);border-top:2px ridge hsl(240, 75%, 60%);"><tbody><tr><td><i>If you suspect that certain groups of experimental units may differ from each other, you can always use those groups as a blocking factor. If the differences do appear, your estimated treatment effects will be more precise than if you had not included blocking in the statistical model.&nbsp;</i></td></tr></tbody></table> **Blocking can protect against bias** Let’s look at an example to see blocking how can _**guard against bias**_ by evening out the variability among experimental units.  Imagine you want to test a new manufacturing process at your factory by measuring levels of daily productivity over four weeks. However, experience tells you that production levels tend to be lower on Thursdays and Fridays, compared to earlier on in the week as employees’ thoughts turn to going home for the weekend. Let’s consider what might happen should you simply randomly select 10 days to use the old manufacturing process and 10 days to use the new. The following table represents one possible randomization: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_blocking6_087a0ce889.jpg) Notice that by not controlling for day of the week, the new manufacturing process is (randomly) over-represented on days where production naturally tends to be higher, whereas the old manufacturing process is (randomly) over-represented on Thursdays and Fridays, where production naturally tends to be lower, resulting in an unfair comparison.  Conversely, had you blocked by day of the week, then the inherent differences between days is evened out and the bias it can potentially cause is no longer an issue. For example, we can have a randomization like: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_blocking7_ce83391dbc.jpg) Note that every treatment (manufacturing process) occurs the same number of times every day. That is, we have a balanced experiment that controls for bias due to day difference. Hence, any resulting production increase or decrease can be more confidently attributed to the manufacturing process used. As shown above, blocking and randomization are critical aspects of good experimental design, providing us with increased precision and protection against bias. <table style="border-bottom:2px ridge hsl(240, 75%, 60%);border-left:2px ridge hsl(240, 75%, 60%);border-right:2px ridge hsl(240, 75%, 60%);border-top:2px ridge hsl(240, 75%, 60%);"><tbody><tr><td><i>You can learn more about blocking and experimental design in Genstat by watching this short YouTube video:&nbsp;</i><a href="https://www.youtube.com/watch?v=4ggSC0EBZGA">Experimental design in Genstat</a></td></tr></tbody></table> [\[1\]](#_ftnref1) Snedecor, G.W. (1946). Statistical methods. The Iowa State College Press, Ames, Iowa, USA.  [\[2\]](#_ftnref1) This data set can be accessed from within [Genstat](https://www.vsni.co.uk/software/genstat). From the menu select **File | Open Example Data Sets** then type “Wheatstrains.gsh” and click **Open**.

READ MORE

Dr. Vanessa Cave

22 days ago
Pearson correlation vs simple linear regression

Both Pearson correlation and simple linear regression can be used to determine if two numeric variables are linearly related. Nevertheless, there are important differences between these two approaches. * Pearson correlation is a measure of the strength and direction of the linear association between two numeric variables that makes no assumption of causality. * Simple linear regression describes the linear relationship between a response variable (denoted by $y$) and an explanatory variable (denoted by $x$) using a statistical model, and this model can be used to make predictions. The following table summarizes the key similarities and differences between Pearson correlation and simple linear regression. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_pearson_linear_regression1_2b2346af6d.jpg) #### **Pearson correlation** Pearson correlation is a number ranging from -1 to 1 that represents the strength of the linear relationship between two numeric variables. A value of 1 corresponds to a perfect positive linear relationship, a value of 0 to no linear relationship, and a value of -1 to a perfect negative relationship. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_pearson_linear_regression2_e0f6b4899a.jpg) The Pearson correlation $(r)$ between variables $x$ and $y$ is calculated using the formula: $$ r = \\frac{\\sum\_{i=1}^n (x\_i-\\bar x)(y\_i-\\bar y)}{\\sqrt{\\sum\_{i=1}^n (x\_i-\\bar x)^2\\sum\_{i=1}^n(y\_i-\\bar y)^2}} $$ where $\\bar x$ is the mean of the $x$ values, and $\\bar y$ is the mean of the $y$ values, and $n$ is the sample size. The Pearson correlation is also known as the _product-moment correlation coefficient_ or simply the _correlation_. #### **Simple linear regression** If we are interested in the effect of an $x$ variate (i.e. a numeric explanatory or independent variable) on a $y$ variate (i.e. a numeric response or dependent variable) regression analysis is appropriate. Simple linear regression describes the response variable $y$ by the model: $$ y = a + bx $$ where the coefficients $a$ and  $b$ are the intercept and slope of the regression line respectively. The intercept $a$ is the value of $y$ when $x$ is zero. The slope $b$ is the change in $y$ for every one unit change in $x$. When the correlation is positive, the slope $(b)$ of the regression line will be positive and vice versa. The above model is theoretical, and in practice there will be error. The statistical model is given by: $$ y = \\hat a + \\hat bx + \\epsilon $$ where $\\epsilon$, a variate of residuals, represents the difference between the predicted $(y = \\hat a + \\hat bx )$ and observed $y$ values. The “hat” (**^**) accent is used to denote values estimated from the observed data. The regression coefficients, $a$ and $b$, are estimated by least squares. This results in a regression line that minimizes the sum of squared residuals. #### Example using Genstat <table style="border-bottom:2px solid hsl(240, 75%, 60%);border-left:2px solid hsl(240, 75%, 60%);border-right:2px solid hsl(240, 75%, 60%);border-top:2px solid hsl(240, 75%, 60%);"><tbody><tr><td><i>Check out our Genstat</i> <a href="https://youtu.be/wHatBwHLrnA"><i>tutorial video</i></a> <i>on Pearson correlation vs simple linear regression.</i></td></tr></tbody></table> You can calculate the Pearson correlation, or fit a simple linear regression, using any general statistical software package.  Here, we’re going to use [Genstat](https://www.vsni.co.uk/software/genstat). **Example:** A concrete manufacturer wants to know if the hardness of their concrete depends on the amount of cement used to make it. They collected data from thirty batches of concrete: <table style="border-bottom:none;border-left:none;border-right:none;border-top:none;"><tbody><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt solid windowtext;height:15.85pt;padding:0cm 5.4pt;vertical-align:top;width:113.45pt;"><p style="text-align:center;">Amount of cement<br>(x)</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:1.0pt solid windowtext;height:15.85pt;padding:0cm 5.4pt;vertical-align:top;width:126.25pt;"><p style="text-align:center;">Hardness of concrete<br>(y)</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt none windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt solid windowtext;height:15.85pt;padding:0cm 5.4pt;vertical-align:top;width:114.05pt;"><p style="text-align:center;">Amount of cement&nbsp;<br>(x)</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt none windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt solid windowtext;height:15.85pt;padding:0cm 5.4pt;vertical-align:top;width:135.3pt;"><p style="text-align:center;">Hardness of concrete&nbsp;<br>(y)</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">16.3</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">58.1</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">14.7</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">46.3</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">23.2</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">80.1</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">20.2</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">65.9</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">18.2</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">47.4</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">19.5</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">65.5</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">20.8</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">55.5</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">20.5</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">53.5</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">21.9</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">65.3</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">25.6</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">77.7</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">23.2</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">72.7</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">24.4</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:15.85pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">73.7</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">17.8</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">65.5</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">21.7</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">64.9</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">24.2</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">81.8</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">18.3</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">57.5</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">16.2</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">53.7</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">20.0</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">62.8</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">23.2</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">61.1</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">20.7</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">58.4</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">20.4</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">58.7</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">24.4</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">70.5</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">14.6</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">52.0</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">25.8</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">78.6</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">17.7</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">58.8</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">20.6</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">62.0</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">15.9</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">56.4</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">17.5</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">55.5</p></td></tr><tr><td style="border-bottom:1.0pt solid windowtext;border-left:1.0pt solid windowtext;border-right:1.0pt solid windowtext;border-top:1.0pt none windowtext;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:113.45pt;"><p style="text-align:center;">15.6</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:4.5pt double windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:126.25pt;"><p style="text-align:center;">49.4</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:114.05pt;"><p style="text-align:center;">26.3</p></td><td style="border-bottom:1.0pt solid windowtext;border-left:none;border-right:1.0pt solid windowtext;border-top:none;height:16.5pt;padding:0cm 5.4pt;vertical-align:bottom;width:135.3pt;"><p style="text-align:center;">67.7</p></td></tr></tbody></table> The scatterplot of the data suggests that the two variables are linearly related: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_pearson_linear_regression3_fc14f0e58e.jpg) Let’s first assess whether there is evidence of a significant Pearson correlation between the hardness of the concrete and the amount of cement used to make it. Our null hypothesis is that true correlation equals zero. Using Genstat, we can see that the correlation estimated from the data is 0.82 with a p-value of \<0.001. That is, there is strong statistical evidence of a linear relationship between the two variables.   Here we see the Correlations menu settings and the output this produces in Genstat: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_pearson_linear_regression4_56fe28f9c9.jpg) Note, the validity of our hypothesis test depends on several assumptions, including that $x$ and $y$ are continuous, jointly normally distributed (i.e. bivariate Normal), random variables.  If the scatterplot indicates a non-linear relationship between $x$ and $y$, the bivariate Normal assumption is violated. You should also check whether both the $x$ and $y$ variables appear to be normally distributed. This can be done graphically (e.g. by inspecting a boxplot, histogram or Q-Q plot) or with a hypothesis test (e.g. the Shapiro-Wilk test). For both variables in our dataset, neither their boxplot nor the Shapiro-Wilk test indicates a lack of normality. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_pearson_linear_regression5_ab8b20c7fe.jpg) As the hardness of the concrete is assumed to represent a response to changes in the amount of cement, it is more informative to model the data using a simple linear regression. Here we see the _Linear Regression_ and _Linear Regression Options_ menu settings and the output these produce in Genstat: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_pearson_linear_regression6_049e20db7a.jpg) Using Genstat, we obtain the following regression line: **Predicted hardness of concrete = 15.91 + 2.297 x amount of cement** That is, the model predicts that for every one unit increase in the amount of cement used, the hardness of the concrete produced increases by 2.297 units. The predicted hardness with 20 units of cement is 15.91 + (2.297 x 20) = 61.85 units. For a simple linear regression, we are also interested in whether there is evidence of a linear relationship with the explanatory variable. This can be assessed using the variance ratio (v.r.) which, under the null hypothesis of no linear relationship, has an F distribution. The p-value from this test is \<0.001, providing strong statistical evidence of a relationship. The percentage variance accounted for summarizes how much variability in the data is explained by the regression model - in this example 65.7%. The residuals $(\\epsilon)$ from the regression model are assumed be independent and normally distributed with constant variance. The residual diagnostic plot (see above) is useful for helping check these assumptions. The histogram (top left), Normal plot (bottom left) and half-Normal plot (bottom right) are used to assess normality; the histogram should be reasonably symmetric and bell-shaped, both the Normal plot and half-Normal plot should form roughly a straight line. The scatterplot of residuals against fitted values (top right) is used to assess the constant variance assumption; the spread of the residuals should be equal over the range of fitted values. It can also reveal violations of the independence assumption or a lack of fit; the points should be randomly scattered without any pattern. For our example, the residual diagnostic plot looks adequate. #### **About the author** Dr Vanessa Cave is an applied statistician, interested in the application of statistics to the biosciences, in particular agriculture and ecology. She is a team leader of Data Science at AgResearch Ltd, New Zealand's government-funded agricultural institute, and is a developer of the Genstat statistical software package. Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and a member of the Data Science Industry Advisory Group for the University of Auckland. She has a PhD in statistics from the University of St Andrew. Vanessa has over a decade of experience collaborating with scientists, using statistics to solve real-world problems.  She provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

READ MORE

Dr. Vanessa Cave

a month ago
Type I and II errors

Hypothesis testing is the art of making decisions using data. It involves evaluating two mutually exclusive statements: the null hypothesis $H\_o$ and the alternative hypothesis $H\_a$. The strength of evidence against the null hypothesis, as provided by the observed data, is often measured using a [p-value](https://vsni.co.uk/blogs/what-is-a-p-value). The smaller the p-value, the stronger the evidence against the null hypothesis. You can learn more about p-values in one of our previous [blogs](https://vsni.co.uk/blogs/what-is-a-p-value). When we make a decision using a hypothesis test there are four possible outcomes: two representing correct decisions and two representing incorrect decisions. The incorrect decisions are due to either _**Type I**_ or _**Type II**_ errors. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_type_I_and_II_errors_5332ecf77a.jpg) A **Type I error**, or a _**false positive**_, occurs when we reject the null hypothesis when in fact it is true. In other words, this is the error of accepting the alternative hypothesis when the results can be attributed to chance. In hypothesis testing, the probability of making a Type I error is often referred to as the _**size**_ of the test or the _**level of significance**_ and denoted by alpha (α). Typically, the Type I error rate is set to 0.05 giving a 1 in 20 chance (i.e. 5%) that a true null hypothesis will be rejected.   A **Type II error** is a _**false negative**_. This occurs when we fail to reject the null hypothesis when in fact it is false. The probability of making a Type II error is usually denoted by β and depends on the _**power**_ of the hypothesis test (β = 1 - power). You can reduce the chance of making a Type II error by increasing the power of your hypothesis test. Power can be increased by increasing the sample size, reducing variability (e.g. by blocking) or decreasing the size of the test. #### **About the author** Dr Vanessa Cave is an applied statistician, interested in the application of statistics to the biosciences, in particular agriculture and ecology. She is a team leader of Data Science at AgResearch Ltd, New Zealand's government-funded agricultural institute, and is a developer of the Genstat statistical software package. Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and a member of the Data Science Industry Advisory Group for the University of Auckland. She has a PhD in statistics from the University of St Andrew. Vanessa has over a decade of experience collaborating with scientists, using statistics to solve real-world problems.  She provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

READ MORE

Dr. Salvador A. Gezan

a month ago
Genotype without phenotype … what a waste of my time!

I recently read a very interesting [opinion piece](https://www.theguardian.com/commentisfree/2021/jun/09/human-genome-genes-genetic-code) published in The Guardian. The author talks about the impact the [**Human Genome Project**](https://www.genome.gov/human-genome-project) (HGP) has had 20 years after the first draft of the human genome was published. Of course, this has been a great accomplishment, and today it is possible to have whole genome sequencing done in less than one week and for a fraction of the original cost. Now there are many more full genomes available for different animal and plant species. These constitute great scientific and technological advancements, and one cannot stop thinking what will be possible 20 years from now! Going back to that article, the author states a very critical aspect of the HGP that I copy in the following paragraph: <table style="border-bottom:2px outset hsl(240, 75%, 60%);border-left:2px outset hsl(240, 75%, 60%);border-right:2px outset hsl(240, 75%, 60%);border-top:2px outset hsl(240, 75%, 60%);"><tbody><tr><td><p><span style="color:#121212;">“The HGP has huge potential benefits&nbsp;for medicine and our understanding of human diversity and origins. But a blizzard of misleading rhetoric surrounded the project, contributing to the widespread and sometimes dangerous misunderstandings about genes that now bedevils the genomic age.“</span></p><p style="text-align:right;">Phillip Ball</p></td></tr></tbody></table> This project has been surrounded by lots of media attention and, as with many scientific communications, one of the things that concerns me was the expected future benefits from this sequencing project. I am not going into detail about all the promises stated (see original article for more details) but what is alarming is that this genome was sold as a ‘\_book of instructions\_’ and ‘\_nature’s complete genetic blueprint for building a human being\_’. <table style="border-bottom:2px outset hsl(240, 75%, 60%);border-left:2px outset hsl(240, 75%, 60%);border-right:2px outset hsl(240, 75%, 60%);border-top:2px outset hsl(240, 75%, 60%);"><tbody><tr><td><p>“<span style="color:#121212;">Misleading rhetoric has fuelled the belief that our genetic code is an ‘instruction book’ – but it’s far more interesting than that…”</span></p><p style="text-align:right;">Phillip Ball</p></td></tr></tbody></table> It is true that the genomic information is critical for understanding things such as gene diversity, propensity to diseases and deleterious mutations. Moreover, nowadays genomic information is used to make genomic predictions, for example **polygenic risk scores** for humans and **genomic breeding values** for plants and animals. But the big fallacy is that a genotype is a ‘**blueprint**’. Here, I disagree as a genotype without a phenotype does not go very far! All achievements in genomics, current and future, come from a close connection between **phenotypic** data and **genotypic** data. For example, genome-wide association (GWAS) used for finding markers to provide early detection of cancer relies on having thousands of individuals (or samples) identified at the different states of the disease. Likewise, for vegetables, identifying SNP markers for increased supermarket shelf-life requires phenotypic data as they age. In my view, the main fallacy in many over-promising genomic projects, including HGP, is the belief that genomics is all you need, reflecting a lack of understanding on how critical phenotypic data is. I have even encountered, among breeding managers, the statement that ‘_phenotypic data and field testing are no longer required if we have genomic data.’!_ Genes alone are thousands of small pieces of information, and there are so many complex aspects to consider, such as genes interacting with environment, and other high-order interactions at the gene level (such as dominance and epistasis), that can only be identified and understood with the use of complex computational tools paired with data on each genotype. <table style="border-bottom:2px outset hsl(240, 75%, 60%);border-left:2px outset hsl(240, 75%, 60%);border-right:2px outset hsl(240, 75%, 60%);border-top:2px outset hsl(240, 75%, 60%);"><tbody><tr><td><p>“<span style="color:#121212;">Life is not a readout of genes – it’s a far more interesting, subtle and contingent process than that.”&nbsp;</span></p><p style="text-align:right;">Phillip Ball</p></td></tr></tbody></table> Genomic data will stay with us for a long time. It has, and will become, cheaper to obtain and at some point it will be treated as a commodity. But good records that evaluate hundreds, thousands or even millions of unique individuals, is expensive and slow to obtain. To highlight a couple of things about this: firstly, increasing precision of phenotyping data will increase heritability; larger heritability values translate into better models, and therefore a higher chance of finding the actual true causal marker of a disease. Secondly, many breeding programs have large quantities of historical data. Often, for this data it is easier today to invest in genotyping than phenotyping, especially if DNA samples have been stored (as with semen or milt in animal breeding), or with DNA directly available from field trials (as with forest breeding programs); therefore, in these cases investment on phenotyping has already been done! Interestingly, the statistical tools that focus on phenotyping data, are not as ‘sexy’ as the genotyping tools. Here we talk about boring aspects such as: replication, blocking, randomization, and then regression analysis, linear (mixed) models, logistic regression, etc. But all these tools are well known and understood, and there is no excuse to ignore our statistical heritage.  Furthermore, statistical tools such as ASReml-R or Genstat are critical for understanding genotype versus phenotype. In a project such as the HGP, 20 years ago only some doors were opened, and as we collect more and more information, there will be many statistical (and computational) challenges, and we will need to develop new techniques that will make all of those promises from the HGP possible; albeit always closely connected to good phenotypic data, otherwise, this will be a waste of our time! #### **About the Author** Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world.  Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook “\_Statistical Methods in Biology: Design and Analysis of Experiments and Regression\_”.

READ MORE

Arthur Bernardeli

2 months ago
Accounting for spatial heterogeneity in plant breeding field trials: a path of no return

Some statistical approaches can cope with spatial heterogeneity in different ways, but special attention must be given to the AR1 x AR1 error modeling. This type of spatial analysis can be performed in the ASReml-R package version 4 (Butler et al., 2017), and it is particularly directed at modeling the residual effect of a genetic/statistical model, by estimating the autoregressive correlation of residuals $(\\xi)$ between  columns and rows in a field. This specific random effect can be defined as  $\\mathbf \\xi = \\{\\xi\_m\\}$ ~ $N(0, R)$ and $R = \\sigma\_\\xi^2$ **\[**$AR1(\\rho\_c)\\otimes AR1 (\\rho\_r)$**\]**,  and another effect, such as an independent error or local error $(\\eta)$ can be added as another residual term.  A recent study elaborated by Bernardeli et al. (2021) showed the benefits of performing spatial analysis in plant breeding studies. The authors evaluated seed composition traits (protein, oil, and storage protein) in a set of soybean field trials and compared several statistical models within and across trials. The models use were a baseline randomized complete block design (RCB), which is widely used in this type of studies, and four variants considering different spatial-structured residual terms.  Despite the slightly greater computational needs in fitting the analysis, the spatial approaches resulted in greater genetic gains, heritability and accuracy than the baseline model (RCB), and this can be verified in the table below (adapted from Bernardeli et al., 2021).  ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_spatial_heterogeneity_table_e131ddc01a.jpg) It is important to highlight that the analytical criteria of BIC (Bayesian Information Criteria) was chosen to assist on model selection. In cases where the spatial models were chosen based on BIC, the heritability and accuracy were superior. When the baseline model was the one selected, the above genetic parameters remained unchanged.  In summary, plant breeders should keep in mind that: phenotype-based field trial analyses through the use of AR1 x AR1 spatial models are at least equal, but often better, and never worse than traditional analyses with independent errors. **References** Butler, D. G., Cullis, B.R., A. R. Gilmour, Gogel, B.G. and Thompson, R. 2017. ASReml-R Reference Manual Version 4. VSN International Ltd, Hemel Hempstead, HP1 1ES, UK. Bernardeli A, Rocha JRASdC, Borém A, et al. Modeling spatial trends and enhancing genetic selection: An approach to soybean seed composition breeding. _Crop Science_. 2021;1–13. https://doi.org/10.1002/csc2.20364.

READ MORE

Dr. Vanessa Cave

2 months ago
Give me confidence! Why I want to see your confidence interval

As an applied statistician working on team research projects, on occasion I’ve battled the misguided view that my key contribution is “rubber stamping” p-values onto research output, and not just any p-value, ideally ones less than 0.05! I don’t dislike p-values, in fact, when used appropriately, I believe they play an important role in inferential statistics. However, simply reporting a p-value is of little value: we also need to understand the strength and direction of the effect and the uncertainty in its estimate. That is: I want to see the confidence interval (CI). Let me explain why. P-values are used to assess whether there is sufficient evidence of an effect. More formally, they provide a measure of strength against the null hypothesis (of no effect). The smaller the p-value, the stronger the evidence for rejecting the null hypothesis in favour of an alternative hypothesis. However, the p-value does not convey any information on the size of the effect or the precision of its estimate. This missing information is very important, and it is here that CIs are extremely helpful. <table style="background-color:hsl(0, 0%, 90%);border-bottom:inset hsl(240, 75%, 60%);border-left:inset hsl(240, 75%, 60%);border-right:inset hsl(240, 75%, 60%);border-top:inset hsl(240, 75%, 60%);"><tbody><tr><td><p style="text-align:center;"><span style="color:#333333;">A confidence interval provides a range of plausible values within which we can be reasonably confident the true effect actually falls.</span></p></td></tr></tbody></table> Let’s look at an example. A few years ago I was involved in a research programme studying the effect a particular treatment had on the pH of meat. pH is important because it impacts the juiciness, tenderness, and the taste of meat. If we could manage pH, we could improve the eating quality. A randomised controlled experiment was conducted, the data analysed, and tadah! There was a statistically significant difference between the treatment and control means (p-value = 0.011). * which mean is lower? * _a lower pH is better_ * how big is the difference? * _a small difference in pH, such as less than 0.1, won’t result in a practical difference in eating quality for consumers_ * how precise is the estimated size of the effect? * _having enough precision is critical for meat producers to make an informed decision about the usefulness of the treatment_ A CI will give us this information. In our example, the 95% CI for the difference between the control and treatment mean is: $$0.013 \\le μ\_{control} \\ – μ\_{treatment} \\le 0.098$$ Telling us that: * the mean pH for the treatment is indeed statistically lower than for the control * the 95% CI is narrow (i.e., the estimate of the effect is precise enough to be useful) but … * the effect size is likely too small to make any real-world practical difference in meat eating quality. As the above example illustrates, CIs enable us to draw conclusions about the direction and strength of the effect and the uncertainty of its estimation, whereas, p-values are useful for indicating the strength of evidence against the null hypothesis. Hence, CIs are important for understanding the practical or biological importance of the estimated effect. #### **About the author** Dr Vanessa Cave is an applied statistician, interested in the application of statistics to the biosciences, in particular agriculture and ecology. She is a team leader of Data Science at AgResearch Ltd, New Zealand's government-funded agricultural institute, and is a developer of the Genstat statistical software package. Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and a member of the Data Science Industry Advisory Group for the University of Auckland. She has a PhD in statistics from the University of St Andrew. Vanessa has over a decade of experience collaborating with scientists, using statistics to solve real-world problems.  She provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

READ MORE

The VSNi Team

2 months ago
Predicting the Performance of Genotypes by Multi-Environment Trial Analyses

Predicting the performance of novel plant genotypes in different environments is essential for developing cultivars with superior, economically important traits (such as yield) and resilience to environmental stresses (such as drought).  <table style="background-color:hsl(0, 0%, 90%);border-bottom:solid hsl(240, 75%, 60%);border-left:solid hsl(240, 75%, 60%);border-right:solid hsl(240, 75%, 60%);border-top:solid hsl(240, 75%, 60%);"><tbody><tr><td><p><strong>Case Study</strong></p><p>The <a href="https://www.rpbc.co.nz/">Radiata Pine Breeding Company Ltd.</a> runs breeding programmes to improve the productivity and quality of wood from radiata pine trees for the Australasian forestry industry.&nbsp; Candidate pine trees (i.e., different genotypes) are tested in a <span style="color:blue;"><strong>multi-environment trial</strong></span><span style="color:black;">&nbsp;</span>(MET), a series of experiments conducted across a range of geographic locations in New Zealand and Australia, over multiple years. The set of experiments within and across years is designed to provide a range of growing conditions or “<span style="color:blue;"><strong>environments</strong></span>” that differ on climate, soil, and sometimes management. From these trials, the genetic and environmental effects on the performance of the candidate pine trees can be assessed and estimated. This information is then used by radiata pine breeders to select trees for future crossing, with the aim of further genetic improvement, and by growers of radiata pine to select cultivars predicted to perform well at their land.</p></td></tr></tbody></table> Plant breeders use MET data to evaluate genotypes across a range of environments. However, the relative performance of the genotypes often varies from environment to environment, a phenomenon known as the **genotype by environment (G×E) interaction**. This lack of consistency, is a good thing, as it can be exploited to identify genotypes that perform well in all environments (i.e., are suitable for broad use) and those with exceptional performance in specific environments (i.e., are well suited for use in certain growing conditions). These are the concepts of broad and specific adaptability that are common in many breeding programs. <table><tbody><tr><td style="border-bottom:solid hsl(240, 75%, 60%);border-left:solid hsl(240, 75%, 60%);border-right:solid hsl(240, 75%, 60%);border-top:solid hsl(240, 75%, 60%);"><p><strong>A Brief History of MET Analysis</strong></p><ul><li>Original methods use analysis of variance (ANOVA) on the two-way table of genotype by environment means. Here, the total variation is partitioned into sources due to genotype, environment and residual variation (a combination of the G×E interaction and within-trial error). Later, estimates of the average genotype performance across environments are obtained.</li></ul><p><span style="color:#FF3300;"><i><strong>Limitation:</strong></i></span> It does not provide information on the nature and architecture of the G×E interaction.</p><ul><li>A greater emphasis on understanding the G×E interaction lead to the development of the AMMI method and the use of GGE biplots. These descriptive tools are great for visualising the relationships between genotypes and environments.</li></ul><p><span style="color:#FF3300;"><i><strong>Limitation:</strong></i></span> They do not provide simple numerical summaries that are useful for plant selection and they do not handle unbalanced data well.</p><ul><li>Today, linear mixed models (LMM) are widely used for the analysis of MET data, in particular, a linear mixed model with a multiplicative factor analytic (FA) model for the G×E effects focusing on variance components. LMM have been found to perform extremely well in terms of parsimoniously describing the G×E interaction and predictive accuracy.</li></ul><p><span style="color:#FF3300;"><i><strong>Advantages:</strong></i></span> They help to understand the architecture and dynamics of the GxE interaction and handle unbalanced and messy data very well.</p></td></tr></tbody></table> Today, **linear mixed models (LMM)** are widely used in the analysis of MET data. The linear mixed model framework accommodates the analysis of genetically and/or experimentally correlated data, heterogeneous variances and unbalanced data sets simultaneously, enabling the accurate prediction of genotype performance within all environments in the data set. In addition, the ability to formally test statistical hypotheses provides greater insight into the nature of the G×E interaction.  A state-of-the-art method for analysing MET data involves a LMM that adopts factor analytic (FA) variance structures for the G×E effects. The FA model provides a parsimonious, yet flexible, method of describing the G×E interaction. For example, it allows for genetic variance heterogeneity across trials and different genetic correlation between each pair of trials. It can also be extended to include genetic relationship information (e.g., pedigree data) so that the genetic effects can be partitioned into additive and non-additive components. Furthermore, it typically has higher predictive accuracy than alternative models when there is a substantial G×E interaction. In addition, its parametrization for BLUP effects and variance component is useful to identify those genotypes with broad or specific adaptability.   A powerful, efficient and reliable software solution for fitting FA or other complex LMM is ASReml and its R library version [ASReml-R](https://www.vsni.co.uk/software/asreml-r). Both are particularly well suited to the analysis of MET data with a large number of genotypes and trials. ASReml has been widely cited in scientific publications, and it is broadly used by many industries for their commercial operations. ASReml and ASReml-R are popular choices by MET data analysts due to: * Flexible syntax that makes fitting complex linear mixed models possible and easy. * An efficient statistical algorithm for fitting the linear mixed model, which makes it feasible to analyse large and complex data sets. * Strong theoretical and statistical background providing reliable estimation and inference.

READ MORE

The VSNi Team

2 months ago
Single record animal model in ASReml

A single record animal model is the simplest, and probably most important, linear mixed model (LMM) used in animal breeding. This model is called **animal model** (or individual model) because we estimate a breeding value for each animal defined in the pedigree. The term **single record** refers to the fact that animals have only one phenotypic observation available (in contrast with repeated measures, where we have multiple records for each individual). **Breeding values** are random effects estimated as part of fitting the LMM, and these correspond to the genetic worth of an individual. In the case of progenitors, this is obtained as the mean deviation of the offspring of the given parent against the population mean. Breeding values are used to select individuals to constitute the next generation.  In this blog we will present an analysis from a swine breeding program where the model has several fixed effects and a single additive genetic effect. The base model is: $y\_{ij} = µ + β\_j + a\_{ij} + e\_{ij}$ where   $y\_{ij}$ is the phenotypic observation of animal $ij$ $µ$ is the overall mean $β\_j$ is a fixed effect such as contemporary group $j$ $a\_{ij}$ is the random additive genetic value of animal $ij$ $e\_{ij}$ is the random residual terms The fixed effects $β\_j$ in this case correspond to **contemporary groups**, but these effects can be any general nuisance factor or continuous variables of interest to control for, such as herds, year, season (also constructed as a single factor often called hys), sex, or some continuous variables to be used as covariates such as initial size or weight. In the above model, we have distributional assumptions associated with the random effects, specifically, **a** ~ MVN(**0**, $σ\_a^2$**A**) and **e** ~ MVN(**0**, $σ^2$**I**). Here, the vector of breeding values **a** has a mean of zero and they have a variance-covariance matrix **A** that corresponds to the numerator relationship matrix, and $σ\_a^2$ is the variance of additive effects (i.e., $σ\_a^2$ = VA). In addition, the vector of residuals has a mean of zero and they have an identity variance-covariance matrix (i.e., independent effects) with $σ^2$ being the error or residual variance.      The above model is fitted with ASReml v4.1 using residual maximum likelihood (REML) to estimate variance components: fixed and random effects (i.e., BLUEs and BLUPs) are obtained by solving the set of mixed model equations. In ASReml we write a text file (with the extension ‘.as’) with all the specifications of the dataset to be used, together with the details of the model to be fitted. The full code is presented below: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_single_record_animal_model_1_da08ba1856.jpg) There are many elements in the above code, only a few of which we will discuss here, but more information can be found in the manual. The structure of the dataset ‘pig\_production.dat’ and a brief description is presented in lines 2 to 11, and for reference we present a few of the lines of this dataset below: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_single_record_animal_model_2_892cadf5ab.jpg) Some of the descriptions of these variables in the ‘.as’ file are critical. For example, `!P` is used to indicate that the factor `animID` is associated with a pedigree file that will be the first file to read. The use of `!I` and `!A` is used for specification of **factors** coded as integers or alphanumeric, respectively. When no qualifier is present the data column will be considered as a continuous real **variable**, such as with `BW100d`. In this example, the use of `!P` requires the reading of a pedigree file, which is critical for fitting an animal model, and in our example this file is the same as the data file. This is possible because the first three columns of the dataset correspond to the ones required for the pedigree: _Animal_, _Sire_ and _Dam_. Additional qualifiers are used for reading these files, such as `!SKIP` and `!ALPHA`. In addition, there are two important qualifiers associated with the analysis. These are: `!MVINCLUDE`, which is required to keep the missing records associated with some of the factors, and the `!DDF 1` qualifier that requests the calculation of an ANOVA table associated with our model using approximated degrees of freedom. Finally, we find the model specification lines: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_single_record_animal_model_3_cc1b846726.jpg) The first term `BW100d` is the response variable. Then the following four terms `Breed`, `POB`, `MOB`.`YOB`and `SEX`are fixed effect factors. Here `MOB`.`YOB` corresponds to a combined factor of all combinations of `MOB` and `YOB` (recall these are month and year of birth, respectively). Then the use of `!r` precedes the definition of random effects; here the only term used is `ped(animID)`, which is the animal effect associated with pedigree. The use of `ped()` is optional here, as the model term `animID` was previously read with the qualifier `!P` before, but this is good practice. Also note that we added a value `10` after this random effect; this is to assist the software with an initial guess of the additive variance component. The second line is required to define complex **variance structures** for residual terms. However, in this case we have a simple structure based on independent errors with a single variance (i.e., `idv`) and this is defined for all units that correspond to each observation. Other structures for random effects and the residual term are possible, and further details can be found in the manual. After fitting the model, a series of output files are produced with the same basename file, but with different filename extensions. The most important outputs for the animal model of above are: ‘.asr’, ‘.sln’, and ‘.pvc’. The ‘.asr’ file contains a summary of data, the iteration sequence, estimates of the variance parameters, and the analysis of variance table together with estimates of the fixed effects, among many other things, and also messages. In our dataset, the additive genetic and residual variances for `BW200d` were estimated to be 11.14 and 79.49 kg$^2$, respectively. Fixed effect tests for this trait show highly significant differences (p \< 0.01) for most factors, as shown below in an excerpt of this file. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_single_record_animal_model_4_02a532c866.jpg) Note that there is additional output in the file ‘.asr’ and probably more that you normally will need. Refer to the manual for additional details and definitions. The file ‘.sln’ has the solutions (BLUEs and BLUPs) from our analysis. A few lines of this output are presented below: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_single_record_animal_model_5_cb3fa8e223.jpg) There are columns to identify the factor and its levels followed by the estimated effect and associated standard error. For example, for animal 477, we note that its BLUE effect (in this case breeding value) is 1.022 kg above the mean. The complete list should help to select the best individuals for this swine breeding study. There was one additional element from the ‘.as’ file that we did not describe, and this corresponds to the command `!VPREDICT` that is used to request the additional estimation of the narrow-sense heritability; this will be reported in the ‘.pvc’ file. The lines used corresponded to: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_single_record_animal_model_6_a93df7af30.jpg) Here we are requesting ASReml to generate a ‘.pin’ file with our variance prediction function request. In this case, we will first take the variance associated with `ped(animID)` and call it `AddVar`, then we sum the variance for `animID` and the residual variance (identified as `idv(units)`). Finally, we take these two elements and divide them. Hence, we just defined the expression: $h^2$ = $σ\_a^2$/($σ\_a^2 + σ^2$). In the file ‘.pvc’ you will notice the following output: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_single_record_animal_model_7_2219ecb31b.jpg) The heritability $h^2$ for `BW100d` is 0.123 ± 0.034. Note that, as indicated, standard errors are approximated as this calculation uses the Delta method. There is another relevant output found in the file ‘.aif’ that reports calculations of each individual’s inbreeding coefficient. This is relevant for selection and control of inbreeding in a program. ASReml produced this additional output because we used the `!AIF` qualifier, but we have not presented the output in this blog. ASReml has many other options and it can handle large databases and fit many complex linear models. Here we only presented a few of its capabilities, but if you want to learn more about ASReml check the [online resources here](https://asreml.kb.vsni.co.uk/article-categories/asreml-documentation/). You can find more details of this product at [https://www.vsni.co.uk/software/asreml](https://www.vsni.co.uk/software/asreml).

READ MORE

The VSNi Team

2 months ago
Should I drop the outliers from my analysis?

Outliers are sample observations that are either much larger or much smaller than the other observations in a dataset. Outliers can skew your dataset, so how should you deal with them? #### An example outlier problem Imagine Jane, the general manager of a chain of computer stores, has asked a statistician, Vanessa, to assist her with the analysis of data on the daily sales at the stores she manages. Vanessa takes a look at the data, and produces a boxplot for each of the stores as shown below. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_dataset_559514824a.jpg) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_boxplot_d7c751f6be.jpg) #### What do you notice about the data? Vanessa pointed out to Jane the presence of outliers in the data from Store 2 on days 10 and 22. Vanessa recommended that Jane checks the accuracy of the data. _Are the outliers due to recording or measurement error?_ If the outliers can’t be attributed to errors in the data, Jane should investigate what might have caused the increased sales on these two particular days. Always investigate outliers - this will help you better understand the data, how it was generated and how to analyse it. #### Should we remove the outliers? Vanessa explained to Jane that we should never drop a data value just because it is an outlier. The nature of the outlier should be investigated before deciding what to do. Whenever there are outliers in the data, we should look for possible causes of error in the data. If you find an error but cannot recover the correct data value, then you should replace the incorrect data value with a missing value. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_message_ef53e91494.jpg) However, outliers can also be real observations, and sometimes these are the most interesting ones! If your outlier can’t be attributed to an error, you shouldn’t remove it from the dataset. Removing data values unnecessarily, just because they are outliers, introduces bias and may lead you to draw the wrong conclusions from your study. #### What should we do if we need/want to keep the outlier? * Transform the data: if the dataset is not normally distributed, we can try transforming the data to normalize it. For example, if the data set has some high-value outliers (i.e. is right skewed), the log transformation will “pull” the high values in. This often works well for count data. * Try a different model/analysis: different analyses may make different distributional assumptions, and you should pick one that is appropriate for your data. For example, count data are generally assumed to follow a Poisson distribution. Alternatively, the outliers may be able to be modelled using an appropriate explanatory variable. For example, computer sales may increase as we approach the start of a new school year. In our example, Vanessa suggested that since the mean for Store 2 is highly influenced by the outliers, the median, another measure of central tendency, seems more appropriate for summarizing the daily sales at each store. Using the statistical software [Genstat](https://www.vsni.co.uk/software/genstat), Vanessa can easily calculate both the mean and median number of sales per store for Jane. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_summary_stats_5e7cb80553.jpg) Vanessa also analyses the data assuming the daily sales have Poisson distributions, by fitting a log-linear model. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_linear_model_1976f60b36.jpg) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_linear_model_output_052e2eccb3.jpg) Notice that Vanessa has included “Day” as a blocking factor in the model to allow for variability due to temporal effects.   From this analysis, Vanessa and Jane conclude that the means (of the Poisson distributions) differ between the stores (p-value \< 0.001). Store 3, on average, has the most computer sales per day, whereas Stores 1 and 4, on average, have the least. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_predictions_5810c7c527.jpg) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_predictions_output_60328e1808.jpg) There are other statistical approaches Vanessa might have used to analyse Jane’s sales data, including a one-way ANOVA blocked by Day on the log-transformed sales data and Friedman’s non-parametric ANOVA. Both approaches are available in Genstat’s comprehensive menu system. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_outliers_ANOVA_menus_213fcda00a.jpg) #### What is the best method to deal with outliers? There are many ways to deal with outliers, but no single method will work in every situation. As we have learnt, we can remove an observation if we have evidence it is an error. But, if that is not the case, we can always use alternative summary statistics, or even different statistical approaches, that accommodate them.

READ MORE

The VSNi Team

3 months ago
Warning! The missing data problem

Missing data are a common problem, even in well-designed and managed research studies. It results in a loss of precision and statistical power, has the potential to cause substantial bias and often complicates the statistical analysis. As data analysts, it is crucial that we understand the reasons for the missing data and apply appropriate methods to account for it. Failure to do so leads to trouble! Let’s consider an example. The following figure illustrates data from a study on the intelligence quotient (IQ) of students living in a city, town, or village. A random selection of students was recruited into the study and their IQs measured. Unfortunately, IQ measurements were not available for all students in the study, resulting in _missing data_ (denoted by NA). To make valid inferences, it is very important that we understand why the data are missing, and what we can do about it. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_missing_data_heads_20f3c4fadd.png) There are three main types of missing data: _**Missing completely at random (MCAR)**_ Here missingness has nothing to do with the subject being studied: instead the missing values are an entirely random subset of the data. In our example, the missingness could be considered MCAR if the missing IQ test results were accidentally deleted by the researcher. _**Missing at random (MAR)**_ Here missingness is not related to the value of the missing data but is related to other observed data. For example, we might find that younger students were less likely to complete the IQ test due to some factor unrelated to their IQ, such as a shorter attention span than older students. This missing data can be considered MAR, as failure to complete the test had nothing to do with their IQ (after accounting for age). _**Missing not at random (MNAR)**_ When the missing data are neither of the above, they are MNAR. In other words, missingness **is** related to the value of the missing data. For example, if failure to complete the IQ test was related to the student’s IQ the missingness would be MNAR. When the data are MCAR, missingness doesn’t induce bias. However, this isn’t the case for MAR and MNAR, and great care needs to be taken to ensure that this does not lead to biased and misleading inferences. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_missing_data_warning_bb9c471df5.jpg) Let’s go back to our example.  Assuming the MCAR mechanism, we could analyse the data in [Genstat](https://www.vsni.co.uk/software/genstat) using an unbalanced ANOVA. Assuming that  MAR is conditional on age, an unbalanced ANOVA with age as a covariate would be an appropriate analysis.  But what to do if the missingness is MNAR? This is much more problematical. Indeed, the only way to reduce potential bias is if we can explicitly model the process that generated the missing data. Challenging indeed! As our example illustrates, when faced with missing data we must apply an appropriate statistical method to accommodate it. The choice of method will depend on the nature of the missingness, in addition to the type of data we have, the aims of our analysis, etc. _**But what about imputation?**_ Imputation replaces missing values with estimated values. That is, the dataset is _completed_ by filling in the missing values. Many different methods of imputation exist, including mean substitution, regression imputation, EM algorithm, and last observation carried forward. Be warned however, imputation may not overcome bias - indeed it may also introduce it! In addition, it does not account for the uncertainty about the imputed missing values, resulting in standard errors that are too low. However, if the proportion of missing values in the dataset is small, imputation can be useful. Many statistical methods can handle datasets with missing values e.g., maximum likelihood, [expected maximization](https://www.statisticshowto.com/em-algorithm-expectation-maximization/), [Bayesian models](https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7). Others, such as principal component analysis and some spatial or temporal mixed models, require complete datasets. Imputation allows us to apply statistical methods requiring complete datasets. As a final thought - the best possible solution for missing data is to prevent the problem from occurring. Carefully planning and managing your study will help minimize the amount of missing data (or at least ensure it is MCAR or MAR!).  When you have missing data, use an appropriate statistical technique to accommodate it. Statistical or computational techniques for imputing missing data should be the last resort.

READ MORE

The VSNi Team

3 months ago
What is a p-value?

A way to decide whether to reject the null hypothesis (H0) against our alternative hypothesis (H1) is to determine the probability of obtaining a test statistic at least as extreme as the one observed under the assumption that H0 is true. This probability is referred to as the “p-value”. It plays an important role in statistics and is critical in most biological research. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_p_value_7e04a8f8c5.png) #### **What is the true meaning of a p-value and how should it be used?** P-values are a continuum (between 0 and 1) that provide a measure of the **strength of evidence** against H0. For example, a value of 0.066, will indicate that there is a probability that we could observe values as large or larger than our critical value with a probability of 6.6%. Note that this p-value is NOT the probability that our alternative hypothesis is correct, it is only a measure of how likely or unlikely we are to observe these extreme events, under repeated sampling, in reference to our calculated value. Also note that this p-value is obtained based on an assumed distribution (e.g., t-distribution for a t-test); hence, p-value will depend strongly on your (correct or incorrect) assumptions. The smaller the p-value, the stronger the evidence for rejecting H0. However, it is difficult to determine what a small value really is. This leads to the typical guidelines of: p \< 0.001 indicating very strong evidence against H0, p \< 0.01 strong evidence, p \< 0.05 moderate evidence, p \< 0.1 weak evidence or a trend, and p ≥ 0.1 indicating insufficient evidence \[1\], and a strong debate on what this threshold should be. But declaring p-values as being either significant or non-significant based on an arbitrary cut-off (e.g. 0.05 or 5%) should be avoided. As [Ronald Fisher](https://mathshistory.st-andrews.ac.uk/Biographies/Fisher/) said: “No scientific worker has a fixed level of significance at which, from year to year, and in all circumstances he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” \[2\]. A very important aspect of the p-value is that it **does not** provide any evidence in support of H0 – it only quantifies evidence against H0. That is, a large p-value does not mean we can accept H0. Take care not to fall into the trap of accepting H0! Similarly, a small p-value tells you that rejecting H0 is plausible, and not that H1 is correct! For useful conclusions to be drawn from a statistical analysis, p-values should be considered alongside the **size of the effect**. Confidence intervals are commonly used to describe the size of the effect and the precision of its estimate. Crucially, statistical significance does not necessarily imply practical (or biological) significance. Small p-values can come from a large sample and a small effect, or a small sample and a large effect. It is also important to understand that the size of a p-value depends critically on the sample size (as this affects the shape of our distribution). Here, with a very very large sample size, H0 may be always rejected even with extremely small differences, even if H0 is nearly (i.e., approximately) true. Conversely, with very small sample size, it may be nearly impossible to reject H0 even if we observed extremely large differences. Hence, p-values need to also be interpreted in relation to the size of the study. #### References \[1\] Ganesh H. and V. Cave. 2018. _P-values, P-values everywhere!_ New Zealand Veterinary Journal. 66(2): 55-56. \[2\] Fisher RA. 1956. _Statistical Methods and Scientific Inferences_. Oliver and Boyd, Edinburgh, UK.

READ MORE

The VSNi Team

3 months ago
Evolution of statistical computing

It is widely acknowledged that the most fundamental developments in statistics in the past 60+ years are driven by information technology (IT). We should not underestimate the importance of pen and paper as a form of IT but it is since people start using computers to do statistical analysis that we really changed the role statistics plays in our research as well as normal life. In this blog we will give a brief historical overview, presenting some of the main general statistics software packages developed from 1957 onwards. Statistical software developed for special purposes will be ignored. We also ignore the most widely used ‘software for statistics’ as Brian Ripley (2002) stated in his famous quote: “Let’s not kid ourselves: the most widely used piece of software for statistics is Excel.” Our focus is some of the packages developed by statisticians for statisticians, which are still evolving to incorporate the latest development of statistics. ### **Ronald Fisher’s Calculating Machines** Pioneer statisticians like [Ronald Fisher](https://www.britannica.com/biography/Ronald-Aylmer-Fisher) started out doing their statistics on pieces of paper and later upgraded to using calculating machines. Fisher bought the first Millionaire calculating machine when he was heading Rothamsted Research’s statistics department in the early 1920s. It cost about £200 at that time, which is equivalent in purchasing power to about £9,141 in 2020. This mechanical calculator could only calculate direct product, but it was very helpful for the statisticians at that time as Fisher mentioned: "Most of my statistics has been learned on the machine." The calculator was heavily used by Fisher’s successor [Frank Yates](https://mathshistory.st-andrews.ac.uk/Biographies/Yates/) (Head of Department 1933-1968) and contributed to much of Yates’ research, such as designs with confounding between treatment interactions and blocks, or split plots, or quasi-factorials. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/Frank_Yates_c50a5fbf55.jpg) _Frank Yates_ Rothamsted Annual Report for 1952: "The analytical work has again involved a very considerable computing effort."  ### **Beginning of the Computer Age** From the early 1950s we entered the computer age. The computer at this time looked little like its modern counterpart, whether it was an Elliott 401 from the UK or an IBM 700/7000 series in the US. Although the first documented statistical package, BMDP, was developed starting in 1957 for IBM mainframes at the UCLA Health Computing Facility, on the other side of the Atlantic Ocean statisticians at [Rothamsted Research](https://www.rothamsted.ac.uk/) began their endeavours to program on an Elliot 401 in 1954. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/Elliott_NRDC_401_computer_b39fd1bbe3.jpg) ### **Programming Statistical Software** When we teach statistics in schools or universities, students very often complain about the difficulties of programming. Looking back at programming in the 1950s will give modern students an appreciation of how easy programming today actually is! An Elliott 401 served one user at a time and requested all input on paper tape (forget your keyboard and intelligent IDE editor). It provided the output to an electric typewriter. All programming had to be in machine code with the instructions and data on a rotating disk with 32-bit word length, 5 "words" of fast-access store, 7 intermediate access tracks of 128 words, 16 further tracks selectable one at a time (= 2949 words – 128 for system). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/computer_paper_tape_99626ba274.jpg) _Computer paper tape_ fitting constants to main effects and interactions in multi-way tables (1957), regression and multiple regression (1956), fitting many standard curves as well as multivariate analysis for latent roots and vectors (1955). Although it sounds very promising with the emerging of statistical programs for research, routine statistical analyses were also performed and these still represented a big challenge, at least computationally. For example, in 1963, which was the last year with the [Elliott 401](https://www.ithistory.org/db/hardware/elliott-brothers-london-ltd/elliott-401) and [Elliott 402](https://www.ithistory.org/db/hardware/elliott-brothers-london-ltd/elliott-402) computers, Rothamsted Research statisticians analysed 14,357 data variables, and this took them 4,731 hours to complete the job. It is hard to imagine the energy consumption as well as the amount of paper tape used for programming. Probably the paper tape (all glued together) would be long enough to circle the equator. ### **Development of Statistical Software: Genstat, SAS, SPSS** The above collection of programs was mainly used for agricultural research at Rothamsted and was not given an umbrella name until John Nelder became Head of the Statistics Department in 1968. The development of Genstat (General Statistics) started from that year and the programming was done in FORTRAN, initially on an IBM machine. In that same year, at North Carolina State University, SAS (Statistical Analysis Software) was almost simultaneously developed by computational statisticians, also for analysing agricultural data to improve crop yields. At around the same time, social scientists at the University of Chicago started to develop SPSS (Statistical Package for the Social Sciences). Although the three packages (Genstat, SAS and SPSS) were developed for different purposes and their functions diverged somewhat later, the basic functions covered similar statistical methodologies. The first version of SPSS was released in 1968. In 1970, the first version of Genstat was released with the functions of ANOVA, regression, principal components and principal coordinate analysis, single-linkage cluster analysis and general calculations on vectors, matrices and tables. The first version of SAS, SAS 71, was released and named after the year of its release. The early versions of all three software packages were written in FORTRAN and designed for mainframe computers. Since the 1980s, with the breakthrough of personal computers, a second generation of statistical software began to emerge. There was an MS-DOS version of Genstat (Genstat 4.03) released with an interactive command line interface in 1980. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/MSDOS_Genstat_4_03_619aab193a.jpg) _Genstat 4.03 for MSDOS_ Around 1985, SAS and SPSS also released a version for personal computers. In the 1980s more players entered this market: STATA was developed from 1985 and JMP was developed from 1989. JMP was, from the very beginning, for Macintosh computers. As a consequence, JMP had a strong focus on visualization as well as graphics from its inception. ### **The Rise of the Statistical Language R** The development of the third generation of statistical computing systems had started before the emergence of software like Genstat 4.03e or SAS 6.01. This development was led by John Chambers and his group in Bell Laboratories since the 1970s. The outcome of their work is the S language. It had been developed into a general purpose language with implementations for classical as well as modern statistical inferences. S language was freely available, and its audience was mainly sophisticated academic users. After the acquisition of S language by the Insightful Corporation and rebranding as S-PLUS, this leading third generation statistical software package was widely used in both theoretical and practical statistics in the 1990s, especially before the release of a stable beta version of the free and open-source software R in the year 2000. R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently widely used by statisticians in academia and industry, together with statistical software developers, data miners and data analysts. Software like Genstat, SAS, SPSS and many other packages had to deal with the challenge from R. Each of these long-standing software packages developed an R interface R or even R interpreters to anticipate the change of user behaviour and ever-increasing adoption of the R computing environment. For example, SAS and SPSS have some R plug-ins to talk to each other. VSNi’s ASReml-R software was developed for ASReml users who want to run mixed model analysis within the R environment, and at the present time there are more ASReml-R users than ASReml standalone users. Users who need reliable and robust mixed effects model fitting adopted ASReml-R as an alternative to other mixed model R packages due to its superior performance and simplified syntax. For Genstat users, msanova was also developed as an R package to provide traditional ANOVA users an R interface to run their analysis. ### **What’s Next?** We have no clear idea about what will represent the fourth generation of statistical software. R, as an open-source software and a platform for prototyping and teaching has the potential to help this change in statistical innovation. An example is the R Shiny package, where web applications can be easily developed to provide statistical computing as online services. But all open-source and commercial software has to face the same challenges of providing fast, reliable and robust statistical analyses that allow for reproducibility of research and, most importantly, use sound and correct statistical inference and theory, something that Ronald Fisher will have expected from his computing machine!

READ MORE

Dr. Salvador A. Gezan

3 months ago
Why do I hate zeros in my dataset?

It is always good practice to explore the data before you fit a model. A clear understanding of the dataset helps you to select the appropriate statistical approach and, in the case of linear models, to identify the corresponding design and treatment structure by defining relevant variates and factors.   So, I have in my hands a dataset from a given study, and I proceed to explore it, maybe to do some data cleaning, but mainly to get familiar with it. Assessing predictors is important but more critical is to evaluate the single or multiple response variables that need to be analysed. And it is in these columns where I often find surprises. Sometimes they contain not only numbers, as they should for linear model responses, but also non-numeric data. I have found comments (‘missing’ or ‘not found’), letters (‘?’), and one or more definitions of missing values (‘NA’, ‘NaN’, ‘\*’, ‘.’ or even ‘-9’). But what is the most disturbing to me is the ZEROS, and I especially hate them when they come in masses!  But why do zeros make me angry?! Because their definition is not clear, and they can be the cause of large errors, and ultimately incorrect models. Here are some of my reasons…   ### **Missing Values**  First, it is common to use zero as the definition for missing values. For example, a plant that did not have any fruit has a zero value. But what if the plant died before fruiting? Yes, it will have zero fruits, but here the experimental unit (the plant) no longer exists. In this case, there is a big difference between a true zero that was observed and a zero because of missing data.  ### **Default Values**  Second, zeros are sometimes used as default values in the columns of spreadsheets. That is, you start with a column of zeros that is replaced by true records. However, for many reasons data points may not be collected (for example, you could skip measuring your last replication), and hence some cells of the spreadsheet are not visited, and their values are unchanged from the zero default. Again, these are true missing values, and therefore they need to be recorded in a way that indicates that they were not observed!  ### **Misleading Values**   Third, zeros are often values reflecting measurements that are below the detection limit. For example, if the weighing balance precision is \<0.5 grams then any weight of seed below 0.5 grams will be recorded as a zero. Yes, we do have a range of seed weights reaching 23 grams, and a small portion might be below 1 gram, but in this case the zeros are not really zeros, they approximate a true unknown record between 0 and 0.5 grams.  When, under an initial exploration of the dataset we discover that there are lots of zeros, we need to question why they are occurring. Of course, conversations with the researcher and the staff doing the data recording will give critical insight. This should help us identify the true zeros from the false ones. If there are no missing values recorded in the data, then we might think that some of these zeros are missing values. Here is where I like to explore additional columns (e.g., survival notes) to help ‘recover’ the missing values. However, it might be impossible to discriminate between the true zeros and the missing values if this extra information was not recorded in the dataset. This unfortunate situation, to the misfortune of my collaborators, might mean that the dataset must be completely discarded.   In the case of missing values due to detection limits, the best approach is to ask the researcher. Here, I like to first make sure that this is really the case, and from there make an educated decision on how to analyse the data. Replacing undetected observations by a zero creates two undesired issues:   1. A bias, as these values are not zero, but for example, as in our previous case they have an average value of 0.25 grams (i.e., half the detection limit), and 2. Reduced background variability, as all undetected observations are recorded with exactly the same value when in fact they are not identical, but we can’t see this variability! Finally, there is another reason for me to hate zeros. Suppose that they are all verified valid numbers, but that we still have a high proportion of zeros in our dataset. For example, in a study on fruit yield, I might have 20% of live plants producing no fruit, resulting in 20% true zeros in my dataset. This large proportion of zeros creates difficulties for traditional statistical analyses. For example, when fitting a linear model, the assumption of an approximate Normal distribution might no longer hold, and this will be reflected in residual plots with a strange appearance!   So, what is the solution for this ‘excess’ of zeros? In some cases, a simple transformation could reduce the influence of these zeros in my analyses. Often, the most logical alternative is to rethink the biological process to model, and this might require something different than our typical statistical tools. For example, we could separate the process into two parts. The first part separates the zeros from the non-zeros using a Binomial model that includes several explanatory variables (e.g., age, size, sex). The second part deals only with the non-zero values and fits another model based on, say a Normal distribution, that will include the same or other explanatory variables, but in this case we model the magnitude of this response. This is the basis of some of the Hurdle models, but other statistical approaches, particularly Bayesian, are also available.  In summary, I have many reasons to hate zeros, and you might have a few additional ones. However, I believe they are a critical part of data exploration: not only they can be the tip of an iceberg leading to a better understanding and modelling of the process under which the data was obtained, but they also help to identify potentially more adequate models to describe the system. Hence, perhaps I should embrace the zeros in my dataset and not be so angry about them! Salvador A. Gezan April, 2021 [LinkedIn](https://www.linkedin.com/in/salvador-gezan-54768a1a/)

READ MORE

Dr. John Rogers

4 months ago
50 years of bioscience statistics

Earlier this year I had an enquiry from Carey Langley of VSNi as to why I had not renewed my Genstat licence. The truth was simple – I have decided to fully retire after 50 years as an agricultural entomologist / applied biologist / consultant. This prompted some reflections about the evolution of bioscience data analysis that I have experienced over that half century, a period during which most of my focus was the interaction between insects and their plant hosts; both how insect feeding impacts on plant growth and crop yield, and how plants impact on the development of the insects that feed on them and on their natural enemies. ### Where it began – paper and post My journey into bioscience data analysis started with undergraduate courses in biometry – yes, it was an agriculture faculty, so it was biometry not statistics. We started doing statistical analyses using full keyboard Monroe calculators (for those of you who don’t know what I am talking about, you can find them [here)](http://www.johnwolff.id.au/calculators/Monroe/Monroe.htm).  It was a simpler time and as undergraduates we thought it was hugely funny to divide 1 by 0 until the blue smoke came out… After leaving university in the early 1970s, I started working for the Agriculture Department of an Australian state government, at a small country research station. Statistical analysis was rudimentary to say the least. If you were motivated, there was always the option of running analyses yourself by hand, given the appearance of the first scientific calculators in the early 1970s. If you wanted a formal statistical analysis of your data, you would mail off a paper copy of the raw data to Biometry Branch… and wait.  Some months later, you would get back your ANOVA, regression, or whatever the biometrician thought appropriate to do, on paper with some indication of what treatments were different from what other treatments.  Dose-mortality data was dealt with by manually plotting data onto probit paper.  ### Enter the mainframe In-house ANOVA programs running on central mainframes were a step forward some years later as it at least enabled us to run our own analyses, as long as you wanted to do an ANOVA…. However, it also required a 2 hours’ drive to the nearest card reader, with the actual computer a further 1000 kilometres away.… The first desktop computer I used for statistical analysis was in the early 1980s and was a CP/M machine with two 8-inch floppy discs with, I think, 256k of memory, and booting it required turning a key and pressing the blue button - yes, really! And about the same time, the local agricultural economist drove us crazy extolling the virtues of a program called Lotus 1-2-3! Having been brought up on a solid diet of the classic texts such as Steele and Torrie, Cochran and Cox and Sokal and Rohlf, the primary frustration during this period was not having ready access to the statistical analyses you knew were appropriate for your data. Typical modes of operating for agricultural scientists in that era were randomised blocks of various degrees of complexity, thus the emphasis on ANOVA in the software that was available in-house. Those of us who also had less-structured ecological data were less well catered for. My first access to a comprehensive statistics package was during the early to mid-1980s at one of the American Land Grant universities. It was a revelation to be able to run virtually whatever statistical test deemed necessary. Access to non-linear regression was a definite plus, given the non-linear nature of many biological responses. As well, being able to run a series of models to test specific hypotheses opened up new options for more elegant and insightful analyses. Looking back from 2021, such things look very trivial, but compared to where we came from in the 1970s, they were significant steps forward. ### Enter Genstat My first exposure to Genstat, VSNi’s stalwart statistical software package, was Genstat for Windows, Third Edition (1997). Simple things like the availability of residual plots made a difference for us entomologists, given that much of our data had non-normal errors; it took the guesswork out of whether and what transformations to use. The availability of regressions with grouped data also opened some previously closed doors.  After a deviation away from hands-on research, I came back to biological-data analysis in the mid-2000s and found myself working with repeated-measures and survival / mortality data, so ventured into repeated-measures restricted maximum likelihood analyses and generalised linear mixed models for the first time (with assistance from a couple of Roger Payne’s training courses in Hobart and Queenstown). Looking back, it is interesting how quickly I became blasé about such computationally intensive analyses that would run in seconds on my laptop or desktop, forgetting that I was doing ANOVAs by hand 40 years earlier when John Nelder was developing generalised linear models. How the world has changed! ### Partnership and support Of importance to my Genstat experience was the level of support that was available to me as a Genstat licensee. Over the last 15 years or so, as I attempted some of these more complex analyses, my aspirations were somewhat ahead of my abilities, and it was always reassuring to know that Genstat Support was only ever an email away. A couple of examples will flesh this out.  Back in 2008, I was working on the relationship between insect-pest density and crop yield using R2LINES, but had extra linear X’s related to plant vigour in addition to the measure of pest infestation. A support-enquiry email produced an overnight response from Roger Payne that basically said, “Try this”. While I slept, Roger had written an extension to R2LINES to incorporate extra linear X’s. This was later incorporated into the regular releases of Genstat. This work led to the clearer specification of the pest densities that warranted chemical control in soybeans and dry beans ([https://doi.org/10.1016/j.cropro.2009.08.016](https://doi.org/10.1016/j.cropro.2009.08.016) and [https://doi.org/10.1016/j.cropro.2009.08.015](https://doi.org/10.1016/j.cropro.2009.08.015)). More recently, I was attempting to disentangle the effects on caterpillar mortality of the two Cry insecticidal proteins in transgenic cotton and, while I got close, I would not have got the analysis to run properly without Roger’s support. The data was scant in the bottom half of the overall dose-response curves for both Cry proteins, but it was possible to fit asymptotic exponentials that modelled the upper half of each curve. The final double-exponential response surface I fitted with Roger’s assistance showed clearly that the dose-mortality response was stronger for one of the Cry proteins than the other, and that there was no synergistic action between the two proteins ([https://doi.org/10.1016/j.cropro.2015.10.013](https://doi.org/10.1016/j.cropro.2015.10.013))  ### The value of a comprehensive statistics packag**e** One thing that I especially appreciate about having access to a comprehensive statistics package such as Genstat is having the capacity to tease apart biological data to get at the underlying relationships. About 10 years ago, I was asked to look at some data on the impact of cold stress on the expression of the Cry2Ab insecticidal protein in transgenic cotton. The data set was seemingly simple - two years of pot-trial data where groups of pots were either left out overnight or protected from low overnight temperatures by being moved into a glasshouse, plus temperature data and Cry2Ab protein levels. A REML analysis, and some correlations and regressions enabled me to show that cold overnight temperatures did reduce Cry2Ab protein levels, that the effects occurred for up to 6 days after the cold period and that the threshold for these effects was approximately 14 Cº ([https://doi.org/10.1603/EC09369](https://doi.org/10.1603/EC09369)). What I took from this piece of work is how powerful a comprehensive statistics package can be in teasing apart important biological insights from what was seemingly very simple data. Note that I did not use any statistics that were cutting edge, just a combination of REML, correlation and regression analyses, but used these techniques to guide the dissection of the relationships in the data to end up with an elegant and insightful outcome. ### Final reflections Looking back over 50 years of work, one thing stands out for me: the huge advances that have occurred in the statistical analysis of biological data has allowed much more insightful statistical analyses that has, in turn, allowed biological scientists to more elegantly pull apart the interactions between insects and their plant hosts.  For me, Genstat has played a pivotal role in that process. I shall miss it. **Dr John Rogers** Research Connections and Consulting St Lucia, Queensland 4067, Australia Phone/Fax: +61 (0)7 3720 9065 Mobile: 0409 200 701 Email: [john.rogers@rcac.net.au](mailto:john.rogers@rcac.net.au) Alternate email: [D.John.Rogers@gmail.com](mailto:D.John.Rogers@gmail.com)

READ MORE

Dr. Salvador A. Gezan

5 months ago
Why machine learning is not (yet) working for genomic prediction

In plant and animal breeding the use of genomic predictions has become widespread, and it is currently being implemented in many species resulting in increased genetic gains. In genomic prediction (GP) thousands of SNP markers are used as input to predict the performance of genotypes. A good model allows the estimation of the performance of a genotype before it is phenotypically measured allowing for cheaper and earlier selections, accelerating breeding programs. At present, most of these predictive models use the SNP markers information to fit linear models, where each marker is associated with an estimated effect. These models are linear, and they incorporate our current understanding of the accumulation of allele effects and the use of the infinitesimal model, where the phenotypic response of an individual is the result of hundreds or thousands of QTLs with small effects. ### Machine Learning - the holy grail? Machine learning (ML) has become widely used in many areas over the last few years. ML is a methodology in which computers are trained with large amounts of data to make predictions. There are many methods, but some of the most common are neural networks, random forests, and decision trees. In ML you do not need to understand the biological system; briefly, you provide the computer algorithm with huge amounts of data as training and you obtain a predictive system that can be used to estimate responses. Of course, its implementation is more complex than this description, and a critical part is evaluating the quality of the predictive system obtained. ML has proven very useful, for example, to compare images to differentiate pictures of cats from dogs, and many other practical uses. Therefore, ML methods seem the logical tool for GP, particularly as we can have a set of genomic data for our crop of interest with up to 200,000 SNPs that were obtained with hundreds or even thousands of individuals.  There have been several studies on the use of ML in GP but the results often have been disappointing. In all cases, our traditional genomic prediction methods (BayesB and GBLUP) consistently have been superior to most ML algorithms. Based on these studies, we are tempted to say that ML is not working for breeding and genomics. Yet this is a surprising result for a tool such as ML that is constantly being praised in the media as very powerful and that is often associated with solving many daily predictive problems.  ### Where Machine Learning is at a disadvantage…for now So, currently ML is not a good option for use in GP, but … it is my belief that ML is still at a disadvantage against other GP methods, and with time it might become as good as other approaches or even the gold standard. Some of the reasons for this statement are detailed below. * ML requires large, often very large, amounts of data. This is usually not available for most of our current breeding programs. It is true that we have thousands, or even millions of SNPs, but these are poor in information, and highly correlated. In addition, our phenotypic records used to train these ML tools, are probably only in the thousands, and not in the hundreds of thousands or millions that are reported in other fields where ML has been used successfully * We have a pretty good understanding of gene action. Note that ML is often a black box, where our understanding of the biological system is ignored. However, for our GP models, we have good clarity on the mode of action of the accumulation of alleles to denote additive effects, and this can be extended to dominant effects. This, followed by the dynamics of Mendelian and Fisherian genetics where we have a few QTLs with strong influences or a large number of QTLs with small influences, has led us to use marker assisted selection and pedigree-based analyses successfully over the last 50 years. * We have an important gap between the computer scientists developing the ML tools we can use, and breeders or quantitative geneticists. In most successful breeding programs, there is a strong statistical component for design and analysis of experiments, and now with the use of genomic data, we have extended our models from pedigree-based analyses to molecular-based analyses or a combination. However, the use of computationally intensive and rapidly evolving ML methods, have been elusive to most breeding programs, and in some cases, this is accompanied by a lack of understanding of the software that trains the ML models. The routine implementation of ML in breeding programs will take some time. But as we accumulate information, and we learn and interact with ML software and its routines, we will slowly see it being used in our crops. This will not mean the end of our more traditional tools or their replacement by ML applications. Our current understanding of the biology and the specific nature of our crops will still make our current toolbox valuable. It is our understanding that at the present, machine learning is not ready for breeding, but in due time it will creep up next to us!!  Salvador A. Gezan March, 2021 [LinkedIn](https://www.linkedin.com/in/salvador-gezan-54768a1a/)

READ MORE

Kanchana Punyawaew and Dr. Vanessa Cave

5 months ago
Mixed models for repeated measures and longitudinal data

The term “**repeated measures**” refers to experimental designs or observational studies in which each experimental unit (or subject) is measured repeatedly over time or space. "**Longitudinal data**" is a special case of repeated measures in which variables are measured over time (often for a comparatively long period of time) and duration itself is typically a variable of interest. In terms of data analysis, it doesn’t really matter what type of data you have, as you can analyze both using mixed models. Remember, the key feature of both types of data is that the response variable is measured more than once on each experimental unit, and these repeated measurements are likely to be correlated. ### Mixed Model Approaches To illustrate the use of mixed model approaches for analyzing repeated measures, we’ll examine a data set from Landau and Everitt’s 2004 book, “_A Handbook of Statistical Analyses using SPSS”. Here, a double-blind, placebo-controlled clinical trial was conducted to determine whether an estrogen treatment reduces post-natal depression. Sixty three subjects were randomly assigned to one of two treatment groups: placebo (27 subjects) and estrogen treatment (36 subjects). Depression scores were measured on each subject at baseline, i.e. before randomization (predep_) and at six two-monthly visits after randomization (_postdep_ at visits 1-6). However, not all the women in the trial had their depression score recorded on all scheduled visits. In this example, the data were measured at fixed, equally spaced, time points. (_Visit_ is time as a factor and _nVisit_ is time as a continuous variable.) There is one between-subject factor (_Group_, i.e. the treatment group, either placebo or estrogen treatment), one within-subject factor (_Visit_ or _nVisit_) and a covariate (_predep_). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_data_4f63d505a9_20e39072bf.png) Using the following plots, we can explore the data. In the first plot below, the depression scores for each subject are plotted against time, including the baseline, separately for each treatment group. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_1_4149bce2a1_20e3c0f240.png) In the second plot, the mean depression score for each treatment group is plotted over time. From these plots, we can see variation among subjects within each treatment group that depression scores for subjects generally decrease with time, and on average the depression score at each visit is lower with the estrogen treatment than the placebo. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_2_92810e7fc9_da9b1e85ff.png) ### Random effects model The simplest approach for [analyzing repeated measures data](https://www.theanalysisfactor.com/repeated-measures-approaches/) is to use a random effects model with _**subject**_ fitted as random. It assumes a constant correlation between all observations on the same subject. The analysis objectives can either be to measure the average treatment effect over time or to assess treatment effects at each time point and to test whether treatment interacts with time. In this example, the treatment (_Group_), time (_Visit_), treatment by time interaction (_Group:Visit_) and baseline (_predep_) effects can all be fitted as fixed. The subject effects are fitted as random, allowing for constant correlation between depression scores taken on the same subject over time. The code and output from fitting this model in [ASReml-R 4](https://www.vsni.co.uk/software/asreml-r) follows; ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/4_020d75dee9.png) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/5_ef250deb61.png) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/6_15e353865d.png) The output from summary() shows that the estimate of subject and residual variance from the model are 15.10 and 11.53, respectively, giving a total variance of 15.10 + 11.53 = 26.63. The Wald test (from the wald.asreml() table) for _predep_, _Group_ and _Visit_ are significant (probability level (Pr) ≤ 0.01). There appears to be no relationship between treatment group and time (_Group:Visit_) i.e. the probability level is greater than 0.05 (Pr = 0.8636). ### Covariance model In practice, often the correlation between observations on the same subject is not constant. It is common to expect that the covariances of measurements made closer together in time are more similar than those at more distant times. Mixed models can accommodate many different covariance patterns. The ideal usage is to select the pattern that best reflects the true covariance structure of the data. A typical strategy is to start with a simple pattern, such as compound symmetry or first-order autoregressive, and test if a more complex pattern leads to a significant improvement in the likelihood. Note: using a covariance model with a simple correlation structure (i.e. uniform) will provide the same results as fitting a random effects model with random subject. In ASReml-R 4 we use the corv() function on time (i.e. _Visit_) to specify uniform correlation between depression scores taken on the same subject over time. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/7_3f3a2b825a.png) Here, the estimate of the correlation among times (_Visit_) is 0.57 and the estimate of the residual variance is 26.63 (identical to the total variance of the random effects model, asr1). Specifying a heterogeneous first-order autoregressive covariance structure is easily done in ASReml-R 4 by changing the variance-covariance function in the residual term from corv() to ar1h(). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/8_27fce61956.png) ### Random coefficients model When the relationship of a measurement with time is of interest, a [random coefficients model](https://encyclopediaofmath.org/wiki/Random_coefficient_models) is often appropriate. In a random coefficients model, time is considered a continuous variable, and the subject and subject by time interaction (_Subject:nVisit_) are fitted as random effects. This allows the slopes and intercepts to vary randomly between subjects, resulting in a separate regression line to be fitted for each subject. However, importantly, the slopes and intercepts are correlated. The str() function of asreml() call is used for fitting a random coefficient model; ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/9_ec27199248.png) The summary table contains the variance parameter for _Subject_ (the set of intercepts, 23.24) and _Subject:nVisit_ (the set of slopes, 0.89), the estimate of correlation between the slopes and intercepts (-0.57) and the estimate of residual variance (8.38). ### References Brady T. West, Kathleen B. Welch and Andrzej T. Galecki (2007). _Linear Mixed Models: A Practical Guide Using Statistical Software_. Chapman & Hall/CRC, Taylor & Francis Group, LLC. Brown, H. and R. Prescott (2015). _Applied Mixed Models in Medicine_. Third Edition. John Wiley & Sons Ltd, England. Sabine Landau and Brian S. Everitt (2004). _A Handbook of Statistical Analyses using SPSS_. Chapman & Hall/CRC Press LLC.

READ MORE

Kanchana Punyawaew

5 months ago
Linear mixed models: a balanced lattice square

This blog illustrates how to analyze data from a field experiment with a balanced lattice square design using linear mixed models. We’ll consider two models: the balanced lattice square model and a spatial model. The example data are from a field experiment conducted at Slate Hall Farm, UK, in 1976 (Gilmour _et al_., 1995). The experiment was set up to compare the performance of 25 varieties of barley and was designed as a balanced lattice square with six replicates laid out in a 10 x 15 rectangular grid. Each replicate contained exactly one plot for every variety. The variety grown in each plot, and the coding of the replicates and lattice blocks, is shown in the field layout below: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_layout_7f57633d37_892b6cf234.png) There are seven columns in the data frame: five blocking factors (_Rep, RowRep, ColRep, Row, Column_), one treatment factor, _Variety_, and the response variate, _yield_. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_data_bd9f4ee008_06c8a6e6fc.png) The six replicates are numbered from 1 to 6 (_Rep_). The lattice block numbering is coded within replicates. That is, within each replicates the lattice rows (_RowRep_) and lattice columns (_ColRep_) are both numbered from 1 to 5. The _Row_ and _Column_ factors define the row and column positions within the field (rather than within each replicate). ### Analysis of a balanced lattice square design To analyze the response variable, _yield_, we need to identify the two basic components of the experiment: the treatment structure and the blocking (or design) structure. The treatment structure consists of the set of treatments, or treatment combinations, selected to study or to compare. In our example, there is one treatment factor with 25 levels, _Variety_ (i.e. the 25 different varieties of barley). The blocking structure of replicates (_Rep_), lattice rows within replicates (_Rep:RowRep_), and lattice columns within replicates (_Rep:ColRep_) reflects the balanced lattice square design. In a mixed model analysis, the treatment factors are (usually) fitted as fixed effects and the blocking factors as random. The balanced lattice square model is fitted in [ASReml-R4](https://www.vsni.co.uk/software/asreml-r) using the following code: ```plaintext &gt; lattice.asr &lt;- asreml(fixed = yield ~ Variety, random = ~ Rep + Rep:RowRep + Rep:ColRep, data=data1) ``` The REML log-likelihood is -707.786. The model’s BIC is: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_2_ac553eac69_6d6d40e073.jpg) The estimated variance components are: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_3_69e11e2dff_c34641a3a9.jpg) The table above contains the estimated variance components for all terms in the random model. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. The variance components for _Rep_, _Rep:RowRep_ and _Rep:ColRep_ are estimated as 4263, 15596, and 14813, respectively. As is typical, the largest unit (replicate) is more variable than its sub-units (lattice rows and columns within replicates). The _"units!R"_ component is the residual variance. By default, fixed effects in ASReml-R4 are tested using sequential Wald tests: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_4_e237aed045_274881533e.jpg) In this example, there are two terms in the summary table: the overall mean, (_Intercept_), and _Variety_. As the tests are sequential, the effect of the _Variety_ is assessed by calculating the change in sums of squares between the two models (_Intercept_)+_Variety_ and (_Intercept_). The p-value (Pr(Chisq)) of  \< 2.2 x 10-16 indicates that _Variety_ is a highly significant. The predicted means for the _Variety_ can be obtained using the predict() function. The standard error of the difference between any pair of variety means is 62. Note: all variety means have the same standard error as the design is balanced. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_5_575ede3e94_5b9209f7c3.jpg) Note: the same analysis is obtained when the random model is redefined as replicates (_Rep_), rows within replicates (_Rep:Row_) and columns within replicates (_Rep:Column_). ### Spatial analysis of a field experiment As the plots are laid out in a grid, the data can also be analyzed using a spatial model. We’ll illustrate spatial analysis by fitting a model with a separable first order autoregressive process in the field row (_Row_) and field column (_Column_) directions. This is often a useful model to start the spatial modeling process. The separable first order autoregressive spatial model is fitted in ASReml-R4 using the following code: ```plaintext &gt; spatial.asr &lt;- asreml(fixed = yield ~ Variety, residual = ~ar1(Row):ar1(Column), data = data1) ``` The BIC for this spatial model is: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_6_3b978358f9_e792bcc2bd.jpg) The estimated variance components and sequential Wald tests are: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_7_82255b3b94_b5bc40e6ab.jpg) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_8_544d852c25_53b792377f.jpg) The residual variance is 38713, the estimated row correlation is 0.458, and the estimated column correlation is 0.684. As for the balanced lattice square model, there is strong evidence of a _Variety_ effect (p-value \< 2.2 x 10-16). A [log-likelihood ratio test](https://www.statisticshowto.com/likelihood-ratio-tests/) cannot be used to compare the balanced lattice square model with the spatial models, as the variance models are not nested. However, the two models can be compared using BIC. As the spatial model has a smaller BIC (1415) than the balanced lattice square model (1435), of the two models explored in this blog, it is chosen as the preferred model. However, selecting the optimal spatial model can be difficult. The current spatial model can be extended by including measurement error (or nugget effect) or revised by selecting a different variance model for the spatial effects. #### References Butler, D.G., Cullis, B.R., Gilmour, A. R., Gogel, B.G. and Thompson, R. (2017). _ASReml-R Reference Manual Version 4._ VSN International Ltd, Hemel Hempstead, HP2 4TP UK. Gilmour, A. R., Anderson, R. D. and Rae, A. L. (1995). _The analysis of binomial data by a generalised linear mixed model_, Biometrika 72: 593-599..

Featured
Single record animal model in ASReml
2 months ago
Why do I hate zeros in my dataset?
3 months ago
Why machine learning is not (yet) working for genomic prediction
5 months ago
plant
plant
plant
A world leader in the advancement and application of algorithmic and analytical content for the smart/precision biotech sector

Follow us

youtube     twitter     linkedin
Copyright © 2000-2021 VSN International Ltd. | Privacy Policy | EULA | Terms & Conditions | Sitemap
VSN International Limited is registered in England & Wales, company number: 4027977 VAT number: GB750 0348 63