##### Should I drop the outliers from my analysis?

The VSNi Team

a month ago

Outliers are sample observations that are either much larger or much smaller than the other observations in a dataset. Outliers can skew your dataset, so how should you deal with them?

#### An example outlier problem

Imagine Jane, the general manager of a chain of computer stores, has asked a statistician, Vanessa, to assist her with the analysis of data on the daily sales at the stores she manages. Vanessa takes a look at the data, and produces a boxplot for each of the stores as shown below.

#### What do you notice about the data?

Vanessa pointed out to Jane the presence of outliers in the data from Store 2 on days 10 and 22. Vanessa recommended that Jane checks the accuracy of the data. Are the outliers due to recording or measurement error? If the outliers can’t be attributed to errors in the data, Jane should investigate what might have caused the increased sales on these two particular days. Always investigate outliers - this will help you better understand the data, how it was generated and how to analyse it.

#### Should we remove the outliers?

Vanessa explained to Jane that we should never drop a data value just because it is an outlier. The nature of the outlier should be investigated before deciding what to do.

Whenever there are outliers in the data, we should look for possible causes of error in the data. If you find an error but cannot recover the correct data value, then you should replace the incorrect data value with a missing value.

However, outliers can also be real observations, and sometimes these are the most interesting ones! If your outlier can’t be attributed to an error, you shouldn’t remove it from the dataset. Removing data values unnecessarily, just because they are outliers, introduces bias and may lead you to draw the wrong conclusions from your study.

#### What should we do if we need/want to keep the outlier?

• Transform the data: if the dataset is not normally distributed, we can try transforming the data to normalize it. For example, if the data set has some high-value outliers (i.e. is right skewed), the log transformation will “pull” the high values in. This often works well for count data.
• Try a different model/analysis: different analyses may make different distributional assumptions, and you should pick one that is appropriate for your data. For example, count data are generally assumed to follow a Poisson distribution. Alternatively, the outliers may be able to be modelled using an appropriate explanatory variable. For example, computer sales may increase as we approach the start of a new school year.

In our example, Vanessa suggested that since the mean for Store 2 is highly influenced by the outliers, the median, another measure of central tendency, seems more appropriate for summarizing the daily sales at each store. Using the statistical software Genstat, Vanessa can easily calculate both the mean and median number of sales per store for Jane.

Vanessa also analyses the data assuming the daily sales have Poisson distributions, by fitting a log-linear model.

Notice that Vanessa has included “Day” as a blocking factor in the model to allow for variability due to temporal effects.

From this analysis, Vanessa and Jane conclude that the means (of the Poisson distributions) differ between the stores (p-value < 0.001). Store 3, on average, has the most computer sales per day, whereas Stores 1 and 4, on average, have the least.

There are other statistical approaches Vanessa might have used to analyse Jane’s sales data, including a one-way ANOVA blocked by Day on the log-transformed sales data and Friedman’s non-parametric ANOVA. Both approaches are available in Genstat’s comprehensive menu system.

#### What is the best method to deal with outliers?

There are many ways to deal with outliers, but no single method will work in every situation. As we have learnt, we can remove an observation if we have evidence it is an error. But, if that is not the case, we can always use alternative summary statistics, or even different statistical approaches, that accommodate them.

The VSNi Team

a month ago
##### What is a p-value?

A way to decide whether to reject the null hypothesis (H0) against our alternative hypothesis (H1) is to determine the probability of obtaining a test statistic at least as extreme as the one observed under the assumption that H0 is true. This probability is referred to as the “p-value”. It plays an important role in statistics and is critical in most biological research. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_p_value_7e04a8f8c5.png) #### **What is the true meaning of a p-value and how should it be used?** P-values are a continuum (between 0 and 1) that provide a measure of the **strength of evidence** against H0. For example, a value of 0.066, will indicate that there is a probability that we could observe values as large or larger than our critical value with a probability of 6.6%. Note that this p-value is NOT the probability that our alternative hypothesis is correct, it is only a measure of how likely or unlikely we are to observe these extreme events, under repeated sampling, in reference to our calculated value. Also note that this p-value is obtained based on an assumed distribution (e.g., t-distribution for a t-test); hence, p-value will depend strongly on your (correct or incorrect) assumptions. The smaller the p-value, the stronger the evidence for rejecting H0. However, it is difficult to determine what a small value really is. This leads to the typical guidelines of: p \< 0.001 indicating very strong evidence against H0, p \< 0.01 strong evidence, p \< 0.05 moderate evidence, p \< 0.1 weak evidence or a trend, and p ≥ 0.1 indicating insufficient evidence $1$, and a strong debate on what this threshold should be. But declaring p-values as being either significant or non-significant based on an arbitrary cut-off (e.g. 0.05 or 5%) should be avoided. As [Ronald Fisher](https://mathshistory.st-andrews.ac.uk/Biographies/Fisher/) said: “No scientific worker has a fixed level of significance at which, from year to year, and in all circumstances he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” $2$. A very important aspect of the p-value is that it **does not** provide any evidence in support of H0 – it only quantifies evidence against H0. That is, a large p-value does not mean we can accept H0. Take care not to fall into the trap of accepting H0! Similarly, a small p-value tells you that rejecting H0 is plausible, and not that H1 is correct! For useful conclusions to be drawn from a statistical analysis, p-values should be considered alongside the **size of the effect**. Confidence intervals are commonly used to describe the size of the effect and the precision of its estimate. Crucially, statistical significance does not necessarily imply practical (or biological) significance. Small p-values can come from a large sample and a small effect, or a small sample and a large effect. It is also important to understand that the size of a p-value depends critically on the sample size (as this affects the shape of our distribution). Here, with a very very large sample size, H0 may be always rejected even with extremely small differences, even if H0 is nearly (i.e., approximately) true. Conversely, with very small sample size, it may be nearly impossible to reject H0 even if we observed extremely large differences. Hence, p-values need to also be interpreted in relation to the size of the study. #### References $1$ Ganesh H. and V. Cave. 2018. _P-values, P-values everywhere!_ New Zealand Veterinary Journal. 66(2): 55-56. $2$ Fisher RA. 1956. _Statistical Methods and Scientific Inferences_. Oliver and Boyd, Edinburgh, UK.

The VSNi Team

2 months ago

Dr. John Rogers

3 months ago
##### 50 years of bioscience statistics

Kanchana Punyawaew and Dr. Vanessa Cave

3 months ago
##### Mixed models for repeated measures and longitudinal data

The term “**repeated measures**” refers to experimental designs or observational studies in which each experimental unit (or subject) is measured repeatedly over time or space. "**Longitudinal data**" is a special case of repeated measures in which variables are measured over time (often for a comparatively long period of time) and duration itself is typically a variable of interest. In terms of data analysis, it doesn’t really matter what type of data you have, as you can analyze both using mixed models. Remember, the key feature of both types of data is that the response variable is measured more than once on each experimental unit, and these repeated measurements are likely to be correlated. ### Mixed Model Approaches To illustrate the use of mixed model approaches for analyzing repeated measures, we’ll examine a data set from Landau and Everitt’s 2004 book, “_A Handbook of Statistical Analyses using SPSS”. Here, a double-blind, placebo-controlled clinical trial was conducted to determine whether an estrogen treatment reduces post-natal depression. Sixty three subjects were randomly assigned to one of two treatment groups: placebo (27 subjects) and estrogen treatment (36 subjects). Depression scores were measured on each subject at baseline, i.e. before randomization (predep_) and at six two-monthly visits after randomization (_postdep_ at visits 1-6). However, not all the women in the trial had their depression score recorded on all scheduled visits. In this example, the data were measured at fixed, equally spaced, time points. (_Visit_ is time as a factor and _nVisit_ is time as a continuous variable.) There is one between-subject factor (_Group_, i.e. the treatment group, either placebo or estrogen treatment), one within-subject factor (_Visit_ or _nVisit_) and a covariate (_predep_). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_data_4f63d505a9_20e39072bf.png) Using the following plots, we can explore the data. In the first plot below, the depression scores for each subject are plotted against time, including the baseline, separately for each treatment group. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_1_4149bce2a1_20e3c0f240.png) In the second plot, the mean depression score for each treatment group is plotted over time. From these plots, we can see variation among subjects within each treatment group that depression scores for subjects generally decrease with time, and on average the depression score at each visit is lower with the estrogen treatment than the placebo. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/repeated_measures_2_92810e7fc9_da9b1e85ff.png) ### Random effects model The simplest approach for [analyzing repeated measures data](https://www.theanalysisfactor.com/repeated-measures-approaches/) is to use a random effects model with _**subject**_ fitted as random. It assumes a constant correlation between all observations on the same subject. The analysis objectives can either be to measure the average treatment effect over time or to assess treatment effects at each time point and to test whether treatment interacts with time. In this example, the treatment (_Group_), time (_Visit_), treatment by time interaction (_Group:Visit_) and baseline (_predep_) effects can all be fitted as fixed. The subject effects are fitted as random, allowing for constant correlation between depression scores taken on the same subject over time. The code and output from fitting this model in [ASReml-R 4](https://www.vsni.co.uk/software/asreml-r) follows; ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/4_020d75dee9.png) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/5_ef250deb61.png) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/6_15e353865d.png) The output from summary() shows that the estimate of subject and residual variance from the model are 15.10 and 11.53, respectively, giving a total variance of 15.10 + 11.53 = 26.63. The Wald test (from the wald.asreml() table) for _predep_, _Group_ and _Visit_ are significant (probability level (Pr) ≤ 0.01). There appears to be no relationship between treatment group and time (_Group:Visit_) i.e. the probability level is greater than 0.05 (Pr = 0.8636). ### Covariance model In practice, often the correlation between observations on the same subject is not constant. It is common to expect that the covariances of measurements made closer together in time are more similar than those at more distant times. Mixed models can accommodate many different covariance patterns. The ideal usage is to select the pattern that best reflects the true covariance structure of the data. A typical strategy is to start with a simple pattern, such as compound symmetry or first-order autoregressive, and test if a more complex pattern leads to a significant improvement in the likelihood. Note: using a covariance model with a simple correlation structure (i.e. uniform) will provide the same results as fitting a random effects model with random subject. In ASReml-R 4 we use the corv() function on time (i.e. _Visit_) to specify uniform correlation between depression scores taken on the same subject over time. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/7_3f3a2b825a.png) Here, the estimate of the correlation among times (_Visit_) is 0.57 and the estimate of the residual variance is 26.63 (identical to the total variance of the random effects model, asr1). Specifying a heterogeneous first-order autoregressive covariance structure is easily done in ASReml-R 4 by changing the variance-covariance function in the residual term from corv() to ar1h(). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/8_27fce61956.png) ### Random coefficients model When the relationship of a measurement with time is of interest, a [random coefficients model](https://encyclopediaofmath.org/wiki/Random_coefficient_models) is often appropriate. In a random coefficients model, time is considered a continuous variable, and the subject and subject by time interaction (_Subject:nVisit_) are fitted as random effects. This allows the slopes and intercepts to vary randomly between subjects, resulting in a separate regression line to be fitted for each subject. However, importantly, the slopes and intercepts are correlated. The str() function of asreml() call is used for fitting a random coefficient model; ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/9_ec27199248.png) The summary table contains the variance parameter for _Subject_ (the set of intercepts, 23.24) and _Subject:nVisit_ (the set of slopes, 0.89), the estimate of correlation between the slopes and intercepts (-0.57) and the estimate of residual variance (8.38). ### References Brady T. West, Kathleen B. Welch and Andrzej T. Galecki (2007). _Linear Mixed Models: A Practical Guide Using Statistical Software_. Chapman & Hall/CRC, Taylor & Francis Group, LLC. Brown, H. and R. Prescott (2015). _Applied Mixed Models in Medicine_. Third Edition. John Wiley & Sons Ltd, England. Sabine Landau and Brian S. Everitt (2004). _A Handbook of Statistical Analyses using SPSS_. Chapman & Hall/CRC Press LLC.

Kanchana Punyawaew

3 months ago
##### Linear mixed models: a balanced lattice square

This blog illustrates how to analyze data from a field experiment with a balanced lattice square design using linear mixed models. We’ll consider two models: the balanced lattice square model and a spatial model. The example data are from a field experiment conducted at Slate Hall Farm, UK, in 1976 (Gilmour _et al_., 1995). The experiment was set up to compare the performance of 25 varieties of barley and was designed as a balanced lattice square with six replicates laid out in a 10 x 15 rectangular grid. Each replicate contained exactly one plot for every variety. The variety grown in each plot, and the coding of the replicates and lattice blocks, is shown in the field layout below: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_layout_7f57633d37_892b6cf234.png) There are seven columns in the data frame: five blocking factors (_Rep, RowRep, ColRep, Row, Column_), one treatment factor, _Variety_, and the response variate, _yield_. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_data_bd9f4ee008_06c8a6e6fc.png) The six replicates are numbered from 1 to 6 (_Rep_). The lattice block numbering is coded within replicates. That is, within each replicates the lattice rows (_RowRep_) and lattice columns (_ColRep_) are both numbered from 1 to 5. The _Row_ and _Column_ factors define the row and column positions within the field (rather than within each replicate). ### Analysis of a balanced lattice square design To analyze the response variable, _yield_, we need to identify the two basic components of the experiment: the treatment structure and the blocking (or design) structure. The treatment structure consists of the set of treatments, or treatment combinations, selected to study or to compare. In our example, there is one treatment factor with 25 levels, _Variety_ (i.e. the 25 different varieties of barley). The blocking structure of replicates (_Rep_), lattice rows within replicates (_Rep:RowRep_), and lattice columns within replicates (_Rep:ColRep_) reflects the balanced lattice square design. In a mixed model analysis, the treatment factors are (usually) fitted as fixed effects and the blocking factors as random. The balanced lattice square model is fitted in [ASReml-R4](https://www.vsni.co.uk/software/asreml-r) using the following code: plaintext &gt; lattice.asr &lt;- asreml(fixed = yield ~ Variety, random = ~ Rep + Rep:RowRep + Rep:ColRep, data=data1)  The REML log-likelihood is -707.786. The model’s BIC is: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_2_ac553eac69_6d6d40e073.jpg) The estimated variance components are: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_3_69e11e2dff_c34641a3a9.jpg) The table above contains the estimated variance components for all terms in the random model. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. The variance components for _Rep_, _Rep:RowRep_ and _Rep:ColRep_ are estimated as 4263, 15596, and 14813, respectively. As is typical, the largest unit (replicate) is more variable than its sub-units (lattice rows and columns within replicates). The _"units!R"_ component is the residual variance. By default, fixed effects in ASReml-R4 are tested using sequential Wald tests: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_4_e237aed045_274881533e.jpg) In this example, there are two terms in the summary table: the overall mean, (_Intercept_), and _Variety_. As the tests are sequential, the effect of the _Variety_ is assessed by calculating the change in sums of squares between the two models (_Intercept_)+_Variety_ and (_Intercept_). The p-value (Pr(Chisq)) of  \< 2.2 x 10-16 indicates that _Variety_ is a highly significant. The predicted means for the _Variety_ can be obtained using the predict() function. The standard error of the difference between any pair of variety means is 62. Note: all variety means have the same standard error as the design is balanced. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_5_575ede3e94_5b9209f7c3.jpg) Note: the same analysis is obtained when the random model is redefined as replicates (_Rep_), rows within replicates (_Rep:Row_) and columns within replicates (_Rep:Column_). ### Spatial analysis of a field experiment As the plots are laid out in a grid, the data can also be analyzed using a spatial model. We’ll illustrate spatial analysis by fitting a model with a separable first order autoregressive process in the field row (_Row_) and field column (_Column_) directions. This is often a useful model to start the spatial modeling process. The separable first order autoregressive spatial model is fitted in ASReml-R4 using the following code: plaintext &gt; spatial.asr &lt;- asreml(fixed = yield ~ Variety, residual = ~ar1(Row):ar1(Column), data = data1)  The BIC for this spatial model is: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_6_3b978358f9_e792bcc2bd.jpg) The estimated variance components and sequential Wald tests are: ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_7_82255b3b94_b5bc40e6ab.jpg) ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/lattice_8_544d852c25_53b792377f.jpg) The residual variance is 38713, the estimated row correlation is 0.458, and the estimated column correlation is 0.684. As for the balanced lattice square model, there is strong evidence of a _Variety_ effect (p-value \< 2.2 x 10-16). A [log-likelihood ratio test](https://www.statisticshowto.com/likelihood-ratio-tests/) cannot be used to compare the balanced lattice square model with the spatial models, as the variance models are not nested. However, the two models can be compared using BIC. As the spatial model has a smaller BIC (1415) than the balanced lattice square model (1435), of the two models explored in this blog, it is chosen as the preferred model. However, selecting the optimal spatial model can be difficult. The current spatial model can be extended by including measurement error (or nugget effect) or revised by selecting a different variance model for the spatial effects. #### References Butler, D.G., Cullis, B.R., Gilmour, A. R., Gogel, B.G. and Thompson, R. (2017). _ASReml-R Reference Manual Version 4._ VSN International Ltd, Hemel Hempstead, HP2 4TP UK. Gilmour, A. R., Anderson, R. D. and Rae, A. L. (1995). _The analysis of binomial data by a generalised linear mixed model_, Biometrika 72: 593-599..

Arthur Bernardeli

5 days ago
##### Accounting for spatial heterogeneity in plant breeding field trials: a path of no return

Some statistical approaches can cope with spatial heterogeneity in different ways, but special attention must be given to the AR1 x AR1 error modeling. This type of spatial analysis can be performed in the ASReml-R package version 4 (Butler et al., 2017), and it is particularly directed at modeling the residual effect of a genetic/statistical model, by estimating the autoregressive correlation of residuals $(\\xi)$ between  columns and rows in a field. This specific random effect can be defined as  $\\mathbf \\xi = \\{\\xi\_m\\}$ ~ $N(0, R)$ and $R = \\sigma\_\\xi^2$ **$**AR1(\\rho\_c)\\otimes AR1 (\\rho\_r)**$**,  and another effect, such as an independent error or local error $(\\eta)$ can be added as another residual term.  A recent study elaborated by Bernardeli et al. (2021) showed the benefits of performing spatial analysis in plant breeding studies. The authors evaluated seed composition traits (protein, oil, and storage protein) in a set of soybean field trials and compared several statistical models within and across trials. The models use were a baseline randomized complete block design (RCB), which is widely used in this type of studies, and four variants considering different spatial-structured residual terms.  Despite the slightly greater computational needs in fitting the analysis, the spatial approaches resulted in greater genetic gains, heritability and accuracy than the baseline model (RCB), and this can be verified in the table below (adapted from Bernardeli et al., 2021).  ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_spatial_heterogeneity_table_e131ddc01a.jpg) It is important to highlight that the analytical criteria of BIC (Bayesian Information Criteria) was chosen to assist on model selection. In cases where the spatial models were chosen based on BIC, the heritability and accuracy were superior. When the baseline model was the one selected, the above genetic parameters remained unchanged.  In summary, plant breeders should keep in mind that: phenotype-based field trial analyses through the use of AR1 x AR1 spatial models are at least equal, but often better, and never worse than traditional analyses with independent errors. **References** Butler, D. G., Cullis, B.R., A. R. Gilmour, Gogel, B.G. and Thompson, R. 2017. ASReml-R Reference Manual Version 4. VSN International Ltd, Hemel Hempstead, HP1 1ES, UK. Bernardeli A, Rocha JRASdC, Borém A, et al. Modeling spatial trends and enhancing genetic selection: An approach to soybean seed composition breeding. _Crop Science_. 2021;1–13. https://doi.org/10.1002/csc2.20364.

The VSNi Team

10 days ago
##### Predicting the Performance of Genotypes by Multi-Environment Trial Analyses

Predicting the performance of novel plant genotypes in different environments is essential for developing cultivars with superior, economically important traits (such as yield) and resilience to environmental stresses (such as drought).  <table style="background-color:hsl(0, 0%, 90%);border-bottom:solid hsl(240, 75%, 60%);border-left:solid hsl(240, 75%, 60%);border-right:solid hsl(240, 75%, 60%);border-top:solid hsl(240, 75%, 60%);"><tbody><tr><td><p><strong>Case Study</strong></p><p>The <a href="https://www.rpbc.co.nz/">Radiata Pine Breeding Company Ltd.</a> runs breeding programmes to improve the productivity and quality of wood from radiata pine trees for the Australasian forestry industry.&nbsp; Candidate pine trees (i.e., different genotypes) are tested in a <span style="color:blue;"><strong>multi-environment trial</strong></span><span style="color:black;">&nbsp;</span>(MET), a series of experiments conducted across a range of geographic locations in New Zealand and Australia, over multiple years. The set of experiments within and across years is designed to provide a range of growing conditions or “<span style="color:blue;"><strong>environments</strong></span>” that differ on climate, soil, and sometimes management. From these trials, the genetic and environmental effects on the performance of the candidate pine trees can be assessed and estimated. This information is then used by radiata pine breeders to select trees for future crossing, with the aim of further genetic improvement, and by growers of radiata pine to select cultivars predicted to perform well at their land.</p></td></tr></tbody></table> Plant breeders use MET data to evaluate genotypes across a range of environments. However, the relative performance of the genotypes often varies from environment to environment, a phenomenon known as the **genotype by environment (G×E) interaction**. This lack of consistency, is a good thing, as it can be exploited to identify genotypes that perform well in all environments (i.e., are suitable for broad use) and those with exceptional performance in specific environments (i.e., are well suited for use in certain growing conditions). These are the concepts of broad and specific adaptability that are common in many breeding programs. <table><tbody><tr><td style="border-bottom:solid hsl(240, 75%, 60%);border-left:solid hsl(240, 75%, 60%);border-right:solid hsl(240, 75%, 60%);border-top:solid hsl(240, 75%, 60%);"><p><strong>A Brief History of MET Analysis</strong></p><ul><li>Original methods use analysis of variance (ANOVA) on the two-way table of genotype by environment means. Here, the total variation is partitioned into sources due to genotype, environment and residual variation (a combination of the G×E interaction and within-trial error). Later, estimates of the average genotype performance across environments are obtained.</li></ul><p><span style="color:#FF3300;"><i><strong>Limitation:</strong></i></span> It does not provide information on the nature and architecture of the G×E interaction.</p><ul><li>A greater emphasis on understanding the G×E interaction lead to the development of the AMMI method and the use of GGE biplots. These descriptive tools are great for visualising the relationships between genotypes and environments.</li></ul><p><span style="color:#FF3300;"><i><strong>Limitation:</strong></i></span> They do not provide simple numerical summaries that are useful for plant selection and they do not handle unbalanced data well.</p><ul><li>Today, linear mixed models (LMM) are widely used for the analysis of MET data, in particular, a linear mixed model with a multiplicative factor analytic (FA) model for the G×E effects focusing on variance components. LMM have been found to perform extremely well in terms of parsimoniously describing the G×E interaction and predictive accuracy.</li></ul><p><span style="color:#FF3300;"><i><strong>Advantages:</strong></i></span> They help to understand the architecture and dynamics of the GxE interaction and handle unbalanced and messy data very well.</p></td></tr></tbody></table> Today, **linear mixed models (LMM)** are widely used in the analysis of MET data. The linear mixed model framework accommodates the analysis of genetically and/or experimentally correlated data, heterogeneous variances and unbalanced data sets simultaneously, enabling the accurate prediction of genotype performance within all environments in the data set. In addition, the ability to formally test statistical hypotheses provides greater insight into the nature of the G×E interaction.  A state-of-the-art method for analysing MET data involves a LMM that adopts factor analytic (FA) variance structures for the G×E effects. The FA model provides a parsimonious, yet flexible, method of describing the G×E interaction. For example, it allows for genetic variance heterogeneity across trials and different genetic correlation between each pair of trials. It can also be extended to include genetic relationship information (e.g., pedigree data) so that the genetic effects can be partitioned into additive and non-additive components. Furthermore, it typically has higher predictive accuracy than alternative models when there is a substantial G×E interaction. In addition, its parametrization for BLUP effects and variance component is useful to identify those genotypes with broad or specific adaptability.   A powerful, efficient and reliable software solution for fitting FA or other complex LMM is ASReml and its R library version [ASReml-R](https://www.vsni.co.uk/software/asreml-r). Both are particularly well suited to the analysis of MET data with a large number of genotypes and trials. ASReml has been widely cited in scientific publications, and it is broadly used by many industries for their commercial operations. ASReml and ASReml-R are popular choices by MET data analysts due to: * Flexible syntax that makes fitting complex linear mixed models possible and easy. * An efficient statistical algorithm for fitting the linear mixed model, which makes it feasible to analyse large and complex data sets. * Strong theoretical and statistical background providing reliable estimation and inference.

The VSNi Team

a month ago
##### Warning! The missing data problem

Missing data are a common problem, even in well-designed and managed research studies. It results in a loss of precision and statistical power, has the potential to cause substantial bias and often complicates the statistical analysis. As data analysts, it is crucial that we understand the reasons for the missing data and apply appropriate methods to account for it. Failure to do so leads to trouble! Let’s consider an example. The following figure illustrates data from a study on the intelligence quotient (IQ) of students living in a city, town, or village. A random selection of students was recruited into the study and their IQs measured. Unfortunately, IQ measurements were not available for all students in the study, resulting in _missing data_ (denoted by NA). To make valid inferences, it is very important that we understand why the data are missing, and what we can do about it. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_missing_data_heads_20f3c4fadd.png) There are three main types of missing data: _**Missing completely at random (MCAR)**_ Here missingness has nothing to do with the subject being studied: instead the missing values are an entirely random subset of the data. In our example, the missingness could be considered MCAR if the missing IQ test results were accidentally deleted by the researcher. _**Missing at random (MAR)**_ Here missingness is not related to the value of the missing data but is related to other observed data. For example, we might find that younger students were less likely to complete the IQ test due to some factor unrelated to their IQ, such as a shorter attention span than older students. This missing data can be considered MAR, as failure to complete the test had nothing to do with their IQ (after accounting for age). _**Missing not at random (MNAR)**_ When the missing data are neither of the above, they are MNAR. In other words, missingness **is** related to the value of the missing data. For example, if failure to complete the IQ test was related to the student’s IQ the missingness would be MNAR. When the data are MCAR, missingness doesn’t induce bias. However, this isn’t the case for MAR and MNAR, and great care needs to be taken to ensure that this does not lead to biased and misleading inferences. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_missing_data_warning_bb9c471df5.jpg) Let’s go back to our example.  Assuming the MCAR mechanism, we could analyse the data in [Genstat](https://www.vsni.co.uk/software/genstat) using an unbalanced ANOVA. Assuming that  MAR is conditional on age, an unbalanced ANOVA with age as a covariate would be an appropriate analysis.  But what to do if the missingness is MNAR? This is much more problematical. Indeed, the only way to reduce potential bias is if we can explicitly model the process that generated the missing data. Challenging indeed! As our example illustrates, when faced with missing data we must apply an appropriate statistical method to accommodate it. The choice of method will depend on the nature of the missingness, in addition to the type of data we have, the aims of our analysis, etc. _**But what about imputation?**_ Imputation replaces missing values with estimated values. That is, the dataset is _completed_ by filling in the missing values. Many different methods of imputation exist, including mean substitution, regression imputation, EM algorithm, and last observation carried forward. Be warned however, imputation may not overcome bias - indeed it may also introduce it! In addition, it does not account for the uncertainty about the imputed missing values, resulting in standard errors that are too low. However, if the proportion of missing values in the dataset is small, imputation can be useful. Many statistical methods can handle datasets with missing values e.g., maximum likelihood, [expected maximization](https://www.statisticshowto.com/em-algorithm-expectation-maximization/), [Bayesian models](https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7). Others, such as principal component analysis and some spatial or temporal mixed models, require complete datasets. Imputation allows us to apply statistical methods requiring complete datasets. As a final thought - the best possible solution for missing data is to prevent the problem from occurring. Carefully planning and managing your study will help minimize the amount of missing data (or at least ensure it is MCAR or MAR!).  When you have missing data, use an appropriate statistical technique to accommodate it. Statistical or computational techniques for imputing missing data should be the last resort.