Should I drop the outliers from my analysis?

Should I drop the outliers from my analysis?

The VSNi Team

17 May 2021

Outliers are sample observations that are either much larger or much smaller than the other observations in a dataset. Outliers can skew your dataset, so how should you deal with them?

An example outlier problem

Imagine Jane, the general manager of a chain of computer stores, has asked a statistician, Vanessa, to assist her with the analysis of data on the daily sales at the stores she manages. Vanessa takes a look at the data, and produces a boxplot for each of the stores as shown below.

What do you notice about the data?

Vanessa pointed out to Jane the presence of outliers in the data from Store 2 on days 10 and 22. Vanessa recommended that Jane checks the accuracy of the data. Are the outliers due to recording or measurement error? If the outliers can’t be attributed to errors in the data, Jane should investigate what might have caused the increased sales on these two particular days. Always investigate outliers - this will help you better understand the data, how it was generated and how to analyse it.

Should we remove the outliers?

Vanessa explained to Jane that we should never drop a data value just because it is an outlier. The nature of the outlier should be investigated before deciding what to do.

Whenever there are outliers in the data, we should look for possible causes of error in the data. If you find an error but cannot recover the correct data value, then you should replace the incorrect data value with a missing value.

However, outliers can also be real observations, and sometimes these are the most interesting ones! If your outlier can’t be attributed to an error, you shouldn’t remove it from the dataset. Removing data values unnecessarily, just because they are outliers, introduces bias and may lead you to draw the wrong conclusions from your study.

What should we do if we need/want to keep the outlier?

• Transform the data: if the dataset is not normally distributed, we can try transforming the data to normalize it. For example, if the data set has some high-value outliers (i.e. is right skewed), the log transformation will “pull” the high values in. This often works well for count data.
• Try a different model/analysis: different analyses may make different distributional assumptions, and you should pick one that is appropriate for your data. For example, count data are generally assumed to follow a Poisson distribution. Alternatively, the outliers may be able to be modelled using an appropriate explanatory variable. For example, computer sales may increase as we approach the start of a new school year.

In our example, Vanessa suggested that since the mean for Store 2 is highly influenced by the outliers, the median, another measure of central tendency, seems more appropriate for summarizing the daily sales at each store. Using the statistical software Genstat, Vanessa can easily calculate both the mean and median number of sales per store for Jane.

Vanessa also analyses the data assuming the daily sales have Poisson distributions, by fitting a log-linear model.

Notice that Vanessa has included “Day” as a blocking factor in the model to allow for variability due to temporal effects.

From this analysis, Vanessa and Jane conclude that the means (of the Poisson distributions) differ between the stores (p-value < 0.001). Store 3, on average, has the most computer sales per day, whereas Stores 1 and 4, on average, have the least.

There are other statistical approaches Vanessa might have used to analyse Jane’s sales data, including a one-way ANOVA blocked by Day on the log-transformed sales data and Friedman’s non-parametric ANOVA. Both approaches are available in Genstat’s comprehensive menu system.

What is the best method to deal with outliers?

There are many ways to deal with outliers, but no single method will work in every situation. As we have learnt, we can remove an observation if we have evidence it is an error. But, if that is not the case, we can always use alternative summary statistics, or even different statistical approaches, that accommodate them.

The VSNi Team

27 April 2021

Evolution of statistical computing

It is widely acknowledged that the most fundamental developments in statistics in the past 60+ years are driven by information technology (IT). We should not underestimate the importance of pen and paper as a form of IT but it is since people start using computers to do statistical analysis that we really changed the role statistics plays in our research as well as normal life.

In this blog we will give a brief historical overview, presenting some of the main general statistics software packages developed from 1957 onwards. Statistical software developed for special purposes will be ignored. We also ignore the most widely used ‘software for statistics’ as Brian Ripley (2002) stated in his famous quote: “Let’s not kid ourselves: the most widely used piece of software for statistics is Excel.” Our focus is some of the packages developed by statisticians for statisticians, which are still evolving to incorporate the latest development of statistics.

Ronald Fisher’s Calculating Machines

Pioneer statisticians like Ronald Fisher started out doing their statistics on pieces of paper and later upgraded to using calculating machines. Fisher bought the first Millionaire calculating machine when he was heading Rothamsted Research’s statistics department in the early 1920s. It cost about £200 at that time, which is equivalent in purchasing power to about £9,141 in 2020. This mechanical calculator could only calculate direct product, but it was very helpful for the statisticians at that time as Fisher mentioned: "Most of my statistics has been learned on the machine." The calculator was heavily used by Fisher’s successor Frank Yates (Head of Department 1933-1968) and contributed to much of Yates’ research, such as designs with confounding between treatment interactions and blocks, or split plots, or quasi-factorials.

Frank Yates

Rothamsted Annual Report for 1952: "The analytical work has again involved a very considerable computing effort."

Beginning of the Computer Age

From the early 1950s we entered the computer age. The computer at this time looked little like its modern counterpart, whether it was an Elliott 401 from the UK or an IBM 700/7000 series in the US. Although the first documented statistical package, BMDP, was developed starting in 1957 for IBM mainframes at the UCLA Health Computing Facility, on the other side of the Atlantic Ocean statisticians at Rothamsted Research began their endeavours to program on an Elliot 401 in 1954.

Programming Statistical Software

When we teach statistics in schools or universities, students very often complain about the difficulties of programming. Looking back at programming in the 1950s will give modern students an appreciation of how easy programming today actually is!

An Elliott 401 served one user at a time and requested all input on paper tape (forget your keyboard and intelligent IDE editor). It provided the output to an electric typewriter. All programming had to be in machine code with the instructions and data on a rotating disk with 32-bit word length, 5 "words" of fast-access store, 7 intermediate access tracks of 128 words, 16 further tracks selectable one at a time (= 2949 words – 128 for system).

Computer paper tape

fitting constants to main effects and interactions in multi-way tables (1957), regression and multiple regression (1956), fitting many standard curves as well as multivariate analysis for latent roots and vectors (1955).

Although it sounds very promising with the emerging of statistical programs for research, routine statistical analyses were also performed and these still represented a big challenge, at least computationally. For example, in 1963, which was the last year with the Elliott 401 and Elliott 402 computers, Rothamsted Research statisticians analysed 14,357 data variables, and this took them 4,731 hours to complete the job. It is hard to imagine the energy consumption as well as the amount of paper tape used for programming. Probably the paper tape (all glued together) would be long enough to circle the equator.

Development of Statistical Software: Genstat, SAS, SPSS

The above collection of programs was mainly used for agricultural research at Rothamsted and was not given an umbrella name until John Nelder became Head of the Statistics Department in 1968. The development of Genstat (General Statistics) started from that year and the programming was done in FORTRAN, initially on an IBM machine. In that same year, at North Carolina State University, SAS (Statistical Analysis Software) was almost simultaneously developed by computational statisticians, also for analysing agricultural data to improve crop yields. At around the same time, social scientists at the University of Chicago started to develop SPSS (Statistical Package for the Social Sciences). Although the three packages (Genstat, SAS and SPSS) were developed for different purposes and their functions diverged somewhat later, the basic functions covered similar statistical methodologies.

The first version of SPSS was released in 1968. In 1970, the first version of Genstat was released with the functions of ANOVA, regression, principal components and principal coordinate analysis, single-linkage cluster analysis and general calculations on vectors, matrices and tables. The first version of SAS, SAS 71, was released and named after the year of its release. The early versions of all three software packages were written in FORTRAN and designed for mainframe computers.

Since the 1980s, with the breakthrough of personal computers, a second generation of statistical software began to emerge. There was an MS-DOS version of Genstat (Genstat 4.03) released with an interactive command line interface in 1980.

Genstat 4.03 for MSDOS

Around 1985, SAS and SPSS also released a version for personal computers. In the 1980s more players entered this market: STATA was developed from 1985 and JMP was developed from 1989. JMP was, from the very beginning, for Macintosh computers. As a consequence, JMP had a strong focus on visualization as well as graphics from its inception.

The Rise of the Statistical Language R

The development of the third generation of statistical computing systems had started before the emergence of software like Genstat 4.03e or SAS 6.01. This development was led by John Chambers and his group in Bell Laboratories since the 1970s. The outcome of their work is the S language. It had been developed into a general purpose language with implementations for classical as well as modern statistical inferences. S language was freely available, and its audience was mainly sophisticated academic users. After the acquisition of S language by the Insightful Corporation and rebranding as S-PLUS, this leading third generation statistical software package was widely used in both theoretical and practical statistics in the 1990s, especially before the release of a stable beta version of the free and open-source software R in the year 2000. R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently widely used by statisticians in academia and industry, together with statistical software developers, data miners and data analysts.

Software like Genstat, SAS, SPSS and many other packages had to deal with the challenge from R. Each of these long-standing software packages developed an R interface R or even R interpreters to anticipate the change of user behaviour and ever-increasing adoption of the R computing environment. For example, SAS and SPSS have some R plug-ins to talk to each other. VSNi’s ASReml-R software was developed for ASReml users who want to run mixed model analysis within the R environment, and at the present time there are more ASReml-R users than ASReml standalone users. Users who need reliable and robust mixed effects model fitting adopted ASReml-R as an alternative to other mixed model R packages due to its superior performance and simplified syntax. For Genstat users, msanova was also developed as an R package to provide traditional ANOVA users an R interface to run their analysis.

What’s Next?

We have no clear idea about what will represent the fourth generation of statistical software. R, as an open-source software and a platform for prototyping and teaching has the potential to help this change in statistical innovation. An example is the R Shiny package, where web applications can be easily developed to provide statistical computing as online services. But all open-source and commercial software has to face the same challenges of providing fast, reliable and robust statistical analyses that allow for reproducibility of research and, most importantly, use sound and correct statistical inference and theory, something that Ronald Fisher will have expected from his computing machine!

The VSNi Team

Last Wednesday at 7:16 AM

ASRgenomics: filling the gap on processing molecular data for quantitative genetics

Most breeding programs are supported by an array of genomic information that will provide new options to increase the rates of genetic gain. However, performing statistical analyses with molecular data can be a difficult task. This type of data has to communicate properly with available phenotypic and pedigree data. The overall success of this integration depends on a set of checks, verifications, filters, and careful preparation of all these datasets in order to be able to fit genetic models successfully and to obtain the required output to make correct decisions.

The workflow of molecular data-driven analysis varies based on the source of the datasets and of course, on personal preferences. Nevertheless, regardless of these aspects, an efficient genomics pipeline should rely on answering some of the following questions:

• How to filter out bad quality markers?
• How to remove redundant marker information (e.g., pruning)?
• How to check the genotypes sample for underlying population structure?
• Which algorithms to use for generating a genomic relationship matrix (GRM)?
• Is the quality and reliability of the GRM suitable for an analysis (i.e., are there duplicates or other inconsistencies)?
• How to modify a GRM if there are duplicates or other inconsistencies?
• How to eliminate bias in a GRM using pedigree information?
• How to combine the GRM with the pedigree to obtain the hybrid matrix H used in ssGBLUP?
• How to obtain a well-conditioned inverse of the GRM?
• How to assess if the inverse of a GRM is good enough for genomics modeling?
• How to efficiently subset and match my datasets (phenotypic, molecular, etc.)?

We developed ASRgenomics to help deal with the above questions. This is a free to use R library which can be downloaded from the ASReml knowledgbase. It is a compilation of proven routines developed over several years of study and hands-on experience in the field. ASRgenomics was built with advanced statistical modeling in mind and it fills a gap by helping you make sure your analyses are as efficient and accurate as they can be with several explicit diagnostic tools.

The package is aimed at geneticists and breeders with the purpose of improving their experience with genomic analyses, such as Genomic Selection (GS) and Genome Wide Association Studies (GWAS), in a straightforward and efficient manner. The main capabilities of the package include:

• Preparing and exploring pedigree, phenotypic and genomic data.
• Calculating and evaluating genomic matrices and their inverse.
• Complementing and expanding results from genomic analyses.

The functions included within ASRgenomics are very flexible and can be used for a tailored workflow from raw molecular data to well-behaved model-ready matrices. Additionally, ASRgenomics is capable of seamlessly preparing genomic datasets for integration with ASReml-R to fit linear mixed models (LMMs; e.g., GBLUP or ssGBLUP).

Please try this free library and check out the user guide included withinin the doc folder inside the download package for a walk-though of the features along with details of the methods.

Dr. Vanessa Cave

10 May 2022

The essential role of statistical thinking in animal ethics: dealing with reduction

Having spent over 15 years working as an applied statistician in the biosciences, I’ve come across my fair-share of animal studies. And one of my greatest bugbears is that the full value is rarely extracted from the experimental data collected. This could be because the best statistical approaches haven’t been employed to analyse the data, the findings are selectively or incorrectly reported, other research programmes that could benefit from the data don’t have access to it, or the data aren’t re-analysed following the advent of new statistical methods or tools that have the potential to draw greater insights from it.

An enormous number of scientific research studies involve animals, and with this come many ethical issues and concerns. To help ensure high standards of animal welfare in scientific research, many governments, universities, R&D companies, and individual scientists have adopted the principles of the 3Rs: Replacement, Reduction and Refinement. Indeed, in many countries the tenets of the 3Rs are enshrined in legislation and regulations around the use of animals in scientific research.

Replacement

Use methods or technologies that replace or avoid the use of animals.

Reduction

Limit the number of animals used.

Refinement

Refine methods in order to minimise or eliminate negative animal welfare impacts.

In this blog, I’ll focus on the second principle, Reduction, and argue that statistical expertise is absolutely crucial for achieving reduction.

The aim of reduction is to minimise the number of animals used in scientific research whilst balancing against any additional adverse animal welfare impacts and without compromising the scientific value of the research. This principle demands that before carrying out an experiment (or survey) involving animals, the researchers must consider and implement approaches that both:

1. Minimise their current animal use – the researchers must consider how to minimise the number of animals in their experiment whilst ensuring sufficient data are obtained to answer their research questions, and
2. Minimise future animal use – the researchers need to consider how to maximise the information obtained from their experiment in order to potentially limit, or avoid, the subsequent use of additional animals in future research.

Both these considerations involve statistical thinking. Let’s begin by exploring the important role statistics plays in minimising current animal use.

Statistical aspects to minimise current animal use

Reduction requires that any experiment (or survey) carried out must use as few animals as possible. However, with too few animals the study will lack the statistical power to draw meaningful conclusions, ultimately wasting animals. But how do we determine how many animals are needed for a sufficiently powered experiment? The necessary starting point is to establish clearly defined, specific research questions. These can then be formulated into appropriate statistical hypotheses, for which an experiment (or survey) can be designed.

Statistical expertise in experimental design plays a pivotal role in ensuring enough of the right type of data are collected to answer the research questions as objectively and as efficiently as possible. For example, sophisticated experimental designs involving blocking can be used to reduce random variation, making the experiment more efficient (i.e., increase the statistical power with fewer animals) as well as guarding against bias. Once a suitable experimental design has been decided upon, a power analysis can be used to calculate the required number of animals (i.e., determine the sample size). Indeed, a power analysis is typically needed to obtain animal ethics approval - a formal process in which the benefits of the proposed research is weighed up against the likely harm to the animals.

Researchers also need to investigate whether pre-existing sources of information or data could be integrated into their study, enabling them to reduce the number of animals required. For example, by means of a meta-analysis. At the extreme end, data relevant to the research questions may already be available, eradicating the need for an experiment altogether!

Statistical aspects to minimise future animal use: doing it right the first time

An obvious mechanism for minimising future animal use is to ensure we do it right the first time, avoiding the need for additional experiments. This is easier said than done; there are many statistical and practical considerations at work here. The following paragraphs cover four important steps in experimental research in which statistical expertise plays a major role: data acquisition, data management, data analysis and inference.

Above, I alluded to the validity of the experimental design. If the design is flawed, the data collected will be compromised, if not essentially worthless. Two common mistakes to avoid are pseudo-replication and the lack of (or poor) randomisation. Replication and randomisation are two of the basic principles of good experimental design. Confusing pseudo-replication (either at the design or analysis stage) for genuine replication will lead to invalid statistical inferences. Randomisation is necessary to ensure the statistical inference is valid and for guarding against bias.

Another extremely important consideration when designing an experiment, and setting the sample size, is the risk and impact of missing data due, for example, to animal drop-out or equipment failure. Missing data results in a loss of statistical power, complicates the statistical analysis, and has the potential to cause substantial bias (and potentially invalidate any conclusions). Careful planning and management of an experiment will help minimise the amount of missing data. In addition, safe-guards, controls or contingencies could be built into the experimental design that help mitigate against the impact of missing data. If missing data does result, appropriate statistical methods to account for it must be applied. Failure to do so could invalidate the entire study.

It is also important that the right data are collected to answer the research questions of interest. That is, the right response and explanatory variables measured at the appropriate scale and frequency. There are many statistical related-questions the researchers must answer, including: what population do they want to make inference about? how generalisable do they need their findings to be? what controllable and uncontrollable variables are there? Answers to these questions not only affects enrolment of animals into the study, but also the conditions they are subjected to and the data that should be collected.

It is essential that the data from the experiment (including meta-data) is appropriately managed and stored to protect its integrity and ensure its usability. If the data get messed up (e.g., if different variables measured on the same animal cannot be linked), is undecipherable (e.g., if the attributes of the variables are unknown) or is incomplete (e.g., if the observations aren’t linked to the structural variables associated with the experimental design), the data are likely worthless. Statisticians can offer invaluable expertise in good data management practices, helping to ensure the data are accurately recorded, the downstream results from analysing the data are reproducible and the data itself is reusable at a later date, by possibly a different group of researchers.

Unsurprisingly, it is also vitally important that the data are analysed correctly, using the methods that draw the most value from it. As expected, statistical expertise plays a huge role here! The results and inference are meaningful only if appropriate statistical methods are used. Moreover, often there is a choice of valid statistical approaches; however, some approaches will be more powerful or more precise than others.

Having analysed the data, it is important that the inference (or conclusions) drawn are sound. Again, statistical thinking is crucial here. For example, in my experience, one all too common mistake in animal studies is to accept the null hypothesis and erroneously claim that a non-significant result means there is no difference (say, between treatment means).

Statistical aspects to minimise future animal use: sharing the value from the experiment

The other important mechanism for minimising future animal use is to share the knowledge and information gleaned. The most basic step here is to ensure that all the results are correctly and non-selectively reported. Reporting all aspects of the trial, including the experimental design and statistical analysis, accurately and completely is crucial for the wider interpretation of the findings, reproducibility and repeatability of the research, and for scientific scrutiny. In addition, all results, including null results, are valuable and should be shared.

Sharing the data (or resources, e.g., animal tissues) also contributes to reduction. The data may be able to be re-used for a different purpose, integrated with other sources of data to provide new insights, or re-analysed in the future using a more advanced statistical technique, or for a different hypothesis.

Statistical aspects to minimise future animal use: maximising the information obtained from the experiment

Another avenue that should also be explored is whether additional data or information can be obtained from the experiment, without incurring any further adverse animal welfare impacts, that could benefit other researchers and/or future studies. For example, to help address a different research question now or in the future. At the outset of the study, researchers must consider whether their proposed study could be combined with another one, whether the research animals could be shared with another experiment (e.g., animals euthanized for one experiment may provide suitable tissue for use in another), what additional data could be collected that may (or is!) of future use, etc.

Statistical thinking clearly plays a fundamental role in reducing the number of animals used in scientific research, and in ensuring the most value is drawn from the resulting data. I strongly believe that statistical expertise must be fully utilised through the duration of the project, from design through to analysis and dissemination of results, in all research projects involving animals to achieving reduction. In my experience, most researchers strive for very high standards of animal ethics, and absolutely do not want to cause unnecessary harm to animals. Unfortunately, the role statistical expertise plays here is not always appreciated or taken advantage of. So next time you’re thinking of undertaking research involving animals, ensure you have expert statistical input!