Using blocking to improve precision and avoid bias

# Using blocking to improve precision and avoid bias

The VSNi Team

13 July 2021

When conducting an experiment, an important consideration is how to even out the variability among the experimental units to make comparisons between the treatments fair and precise. Ideally, we should try to minimize the variability by carefully controlling the conditions under which we conduct the experiment. However, there are many situations where the experimental units are non-uniform. For example:

• in a field experiment laid out on a slope, the plots at the bottom of the slope may be more fertile than the plots at the top,
• in a medical trial, the weight and age of subjects may vary.

When you know there are differences between the experimental units (and these differences may potentially affect your response), you can improve precision and avoid bias by blocking. Blocking involves grouping the experimental units into more-or-less homogeneous groups, so that the experimental units within each block are as alike as possible. For example, in the field experiment described above, plots would be blocked (i.e., grouped) according to slope, and in medical trial, subjects would be blocked into groups of similar weight and age. Once the blocks are formed, the treatments are then randomized to the experimental units within each block.

 Blocking is used to control nuisance variation by creating homogeneous groups of experimental units, known as blocks.

Blocking can improve precision

Let’s look at an example[1] to see how blocking improves the precision of an experiment by reducing the unexplained variation. In this field trial, the yields (pounds per plot) of four strains of Gallipoli wheat were studied. During the design phase, the 20 experimental plots were grouped into five blocks (each containing 4 plots). Within each block, the four wheat strains (A, B, C and D) were randomly assigned to the plots. This is an example of a randomized complete block design (RCBD).

 In randomized complete block design (RCBD)…the experimental units are grouped into homogeneous blocks, each block has the same number of experimental units, usually one for each treatment,within each block, the treatments are randomly allocated to the experimental units so that each treatment occurs in each block the same number of times.

To demonstrate the advantage of blocking, we’ll analyse the data in Genstat[2]as both a completely randomized design (CRD, which ignores the blocking), and as a RCBD (which takes the blocking into account). One of the assumptions behind a CRD is that the set of experimental units to which the treatments are applied are effectively homogeneous.

 CRD RCBD
 CRD
 RCBD

The ANOVA tables from the two analyses are given above. Notice that the ANOVA table for the RCBD has an additional line, “Blocks stratum”. This records the variation between blocks. Also note that the treatment effects (i.e., strains) are now estimated in the “Blocks.*Units*  stratum”, which represents the variation within blocks. As a result:

A: the residual mean square (i.e., the unexplained variation) has decreased from 2.983 to 2.188

B: the standard error of the difference (s.e.d.) has decreased from 1.092 to 0.936.

That is, blocking has improved the precision of the experiment! This increase in precision means that we have a better chance of detecting differences between the wheat strains, making this experiment more efficient and with increased statistical power.

 If you suspect that certain groups of experimental units may differ from each other, you can always use those groups as a blocking factor. If the differences do appear, your estimated treatment effects will be more precise than if you had not included blocking in the statistical model.

Blocking can protect against bias

Let’s look at an example to see blocking how can guard against bias by evening out the variability among experimental units.

Imagine you want to test a new manufacturing process at your factory by measuring levels of daily productivity over four weeks. However, experience tells you that production levels tend to be lower on Thursdays and Fridays, compared to earlier on in the week as employees’ thoughts turn to going home for the weekend. Let’s consider what might happen should you simply randomly select 10 days to use the old manufacturing process and 10 days to use the new. The following table represents one possible randomization:

Notice that by not controlling for day of the week, the new manufacturing process is (randomly) over-represented on days where production naturally tends to be higher, whereas the old manufacturing process is (randomly) over-represented on Thursdays and Fridays, where production naturally tends to be lower, resulting in an unfair comparison.

Conversely, had you blocked by day of the week, then the inherent differences between days is evened out and the bias it can potentially cause is no longer an issue. For example, we can have a randomization like:

Note that every treatment (manufacturing process) occurs the same number of times every day. That is, we have a balanced experiment that controls for bias due to day difference. Hence, any resulting production increase or decrease can be more confidently attributed to the manufacturing process used.

As shown above, blocking and randomization are critical aspects of good experimental design, providing us with increased precision and protection against bias.

[1] Snedecor, G.W. (1946). Statistical methods. The Iowa State College Press, Ames, Iowa, USA.

[2] This data set can be accessed from within Genstat. From the menu select File | Open Example Data Sets then type “Wheatstrains.gsh” and click Open.

The VSNi Team

04 May 2021

What is a p-value?

A way to decide whether to reject the null hypothesis (H0) against our alternative hypothesis (H1) is to determine the probability of obtaining a test statistic at least as extreme as the one observed under the assumption that H0 is true. This probability is referred to as the “p-value”. It plays an important role in statistics and is critical in most biological research.

#### What is the true meaning of a p-value and how should it be used?

P-values are a continuum (between 0 and 1) that provide a measure of the strength of evidence against H0. For example, a value of 0.066, will indicate that there is a probability that we could observe values as large or larger than our critical value with a probability of 6.6%. Note that this p-value is NOT the probability that our alternative hypothesis is correct, it is only a measure of how likely or unlikely we are to observe these extreme events, under repeated sampling, in reference to our calculated value. Also note that this p-value is obtained based on an assumed distribution (e.g., t-distribution for a t-test); hence, p-value will depend strongly on your (correct or incorrect) assumptions.

“No scientific worker has a fixed level of significance at which, from year to year, and in all circumstances he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” [2].

A very important aspect of the p-value is that it does not provide any evidence in support of H0 – it only quantifies evidence against H0. That is, a large p-value does not mean we can accept H0. Take care not to fall into the trap of accepting H0! Similarly, a small p-value tells you that rejecting H0 is plausible, and not that H1 is correct!

For useful conclusions to be drawn from a statistical analysis, p-values should be considered alongside the size of the effect. Confidence intervals are commonly used to describe the size of the effect and the precision of its estimate. Crucially, statistical significance does not necessarily imply practical (or biological) significance. Small p-values can come from a large sample and a small effect, or a small sample and a large effect.

It is also important to understand that the size of a p-value depends critically on the sample size (as this affects the shape of our distribution). Here, with a very very large sample size, H0 may be always rejected even with extremely small differences, even if H0 is nearly (i.e., approximately) true. Conversely, with very small sample size, it may be nearly impossible to reject H0 even if we observed extremely large differences. Hence, p-values need to also be interpreted in relation to the size of the study.

#### References

[1] Ganesh H. and V. Cave. 2018. P-values, P-values everywhere! New Zealand Veterinary Journal. 66(2): 55-56.

[2] Fisher RA. 1956. Statistical Methods and Scientific Inferences. Oliver and Boyd, Edinburgh, UK.

Dr. Vanessa Cave

10 May 2022

The essential role of statistical thinking in animal ethics: dealing with reduction

Having spent over 15 years working as an applied statistician in the biosciences, I’ve come across my fair-share of animal studies. And one of my greatest bugbears is that the full value is rarely extracted from the experimental data collected. This could be because the best statistical approaches haven’t been employed to analyse the data, the findings are selectively or incorrectly reported, other research programmes that could benefit from the data don’t have access to it, or the data aren’t re-analysed following the advent of new statistical methods or tools that have the potential to draw greater insights from it.

An enormous number of scientific research studies involve animals, and with this come many ethical issues and concerns. To help ensure high standards of animal welfare in scientific research, many governments, universities, R&D companies, and individual scientists have adopted the principles of the 3Rs: Replacement, Reduction and Refinement. Indeed, in many countries the tenets of the 3Rs are enshrined in legislation and regulations around the use of animals in scientific research.

#### Replacement

Use methods or technologies that replace or avoid the use of animals.

#### Reduction

Limit the number of animals used.

#### Refinement

Refine methods in order to minimise or eliminate negative animal welfare impacts.

In this blog, I’ll focus on the second principle, Reduction, and argue that statistical expertise is absolutely crucial for achieving reduction.

The aim of reduction is to minimise the number of animals used in scientific research whilst balancing against any additional adverse animal welfare impacts and without compromising the scientific value of the research. This principle demands that before carrying out an experiment (or survey) involving animals, the researchers must consider and implement approaches that both:

1. Minimise their current animal use – the researchers must consider how to minimise the number of animals in their experiment whilst ensuring sufficient data are obtained to answer their research questions, and
2. Minimise future animal use – the researchers need to consider how to maximise the information obtained from their experiment in order to potentially limit, or avoid, the subsequent use of additional animals in future research.

Both these considerations involve statistical thinking. Let’s begin by exploring the important role statistics plays in minimising current animal use.

### Statistical aspects to minimise current animal use

Reduction requires that any experiment (or survey) carried out must use as few animals as possible. However, with too few animals the study will lack the statistical power to draw meaningful conclusions, ultimately wasting animals. But how do we determine how many animals are needed for a sufficiently powered experiment? The necessary starting point is to establish clearly defined, specific research questions. These can then be formulated into appropriate statistical hypotheses, for which an experiment (or survey) can be designed.

Statistical expertise in experimental design plays a pivotal role in ensuring enough of the right type of data are collected to answer the research questions as objectively and as efficiently as possible. For example, sophisticated experimental designs involving blocking can be used to reduce random variation, making the experiment more efficient (i.e., increase the statistical power with fewer animals) as well as guarding against bias. Once a suitable experimental design has been decided upon, a power analysis can be used to calculate the required number of animals (i.e., determine the sample size). Indeed, a power analysis is typically needed to obtain animal ethics approval - a formal process in which the benefits of the proposed research is weighed up against the likely harm to the animals.

Researchers also need to investigate whether pre-existing sources of information or data could be integrated into their study, enabling them to reduce the number of animals required. For example, by means of a meta-analysis. At the extreme end, data relevant to the research questions may already be available, eradicating the need for an experiment altogether!

### Statistical aspects to minimise future animal use: doing it right the first time

An obvious mechanism for minimising future animal use is to ensure we do it right the first time, avoiding the need for additional experiments. This is easier said than done; there are many statistical and practical considerations at work here. The following paragraphs cover four important steps in experimental research in which statistical expertise plays a major role: data acquisition, data management, data analysis and inference.

Above, I alluded to the validity of the experimental design. If the design is flawed, the data collected will be compromised, if not essentially worthless. Two common mistakes to avoid are pseudo-replication and the lack of (or poor) randomisation. Replication and randomisation are two of the basic principles of good experimental design. Confusing pseudo-replication (either at the design or analysis stage) for genuine replication will lead to invalid statistical inferences. Randomisation is necessary to ensure the statistical inference is valid and for guarding against bias.

Another extremely important consideration when designing an experiment, and setting the sample size, is the risk and impact of missing data due, for example, to animal drop-out or equipment failure. Missing data results in a loss of statistical power, complicates the statistical analysis, and has the potential to cause substantial bias (and potentially invalidate any conclusions). Careful planning and management of an experiment will help minimise the amount of missing data. In addition, safe-guards, controls or contingencies could be built into the experimental design that help mitigate against the impact of missing data. If missing data does result, appropriate statistical methods to account for it must be applied. Failure to do so could invalidate the entire study.

It is also important that the right data are collected to answer the research questions of interest. That is, the right response and explanatory variables measured at the appropriate scale and frequency. There are many statistical related-questions the researchers must answer, including: what population do they want to make inference about? how generalisable do they need their findings to be? what controllable and uncontrollable variables are there? Answers to these questions not only affects enrolment of animals into the study, but also the conditions they are subjected to and the data that should be collected.

It is essential that the data from the experiment (including meta-data) is appropriately managed and stored to protect its integrity and ensure its usability. If the data get messed up (e.g., if different variables measured on the same animal cannot be linked), is undecipherable (e.g., if the attributes of the variables are unknown) or is incomplete (e.g., if the observations aren’t linked to the structural variables associated with the experimental design), the data are likely worthless. Statisticians can offer invaluable expertise in good data management practices, helping to ensure the data are accurately recorded, the downstream results from analysing the data are reproducible and the data itself is reusable at a later date, by possibly a different group of researchers.

Unsurprisingly, it is also vitally important that the data are analysed correctly, using the methods that draw the most value from it. As expected, statistical expertise plays a huge role here! The results and inference are meaningful only if appropriate statistical methods are used. Moreover, often there is a choice of valid statistical approaches; however, some approaches will be more powerful or more precise than others.

Having analysed the data, it is important that the inference (or conclusions) drawn are sound. Again, statistical thinking is crucial here. For example, in my experience, one all too common mistake in animal studies is to accept the null hypothesis and erroneously claim that a non-significant result means there is no difference (say, between treatment means).

### Statistical aspects to minimise future animal use: sharing the value from the experiment

The other important mechanism for minimising future animal use is to share the knowledge and information gleaned. The most basic step here is to ensure that all the results are correctly and non-selectively reported. Reporting all aspects of the trial, including the experimental design and statistical analysis, accurately and completely is crucial for the wider interpretation of the findings, reproducibility and repeatability of the research, and for scientific scrutiny. In addition, all results, including null results, are valuable and should be shared.

Sharing the data (or resources, e.g., animal tissues) also contributes to reduction. The data may be able to be re-used for a different purpose, integrated with other sources of data to provide new insights, or re-analysed in the future using a more advanced statistical technique, or for a different hypothesis.

### Statistical aspects to minimise future animal use: maximising the information obtained from the experiment

Another avenue that should also be explored is whether additional data or information can be obtained from the experiment, without incurring any further adverse animal welfare impacts, that could benefit other researchers and/or future studies. For example, to help address a different research question now or in the future. At the outset of the study, researchers must consider whether their proposed study could be combined with another one, whether the research animals could be shared with another experiment (e.g., animals euthanized for one experiment may provide suitable tissue for use in another), what additional data could be collected that may (or is!) of future use, etc.

Statistical thinking clearly plays a fundamental role in reducing the number of animals used in scientific research, and in ensuring the most value is drawn from the resulting data. I strongly believe that statistical expertise must be fully utilised through the duration of the project, from design through to analysis and dissemination of results, in all research projects involving animals to achieving reduction. In my experience, most researchers strive for very high standards of animal ethics, and absolutely do not want to cause unnecessary harm to animals. Unfortunately, the role statistical expertise plays here is not always appreciated or taken advantage of. So next time you’re thinking of undertaking research involving animals, ensure you have expert statistical input!

Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems.  Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.

Kanchana Punyawaew and Dr. Vanessa Cave

01 March 2021

Mixed models for repeated measures and longitudinal data

The term "repeated measures" refers to experimental designs or observational studies in which each experimental unit (or subject) is measured repeatedly over time or space. "Longitudinal data" is a special case of repeated measures in which variables are measured over time (often for a comparatively long period of time) and duration itself is typically a variable of interest.

In terms of data analysis, it doesn’t really matter what type of data you have, as you can analyze both using mixed models. Remember, the key feature of both types of data is that the response variable is measured more than once on each experimental unit, and these repeated measurements are likely to be correlated.

### Mixed Model Approaches

To illustrate the use of mixed model approaches for analyzing repeated measures, we’ll examine a data set from Landau and Everitt’s 2004 book, “A Handbook of Statistical Analyses using SPSS”. Here, a double-blind, placebo-controlled clinical trial was conducted to determine whether an estrogen treatment reduces post-natal depression. Sixty three subjects were randomly assigned to one of two treatment groups: placebo (27 subjects) and estrogen treatment (36 subjects). Depression scores were measured on each subject at baseline, i.e. before randomization (predep) and at six two-monthly visits after randomization (postdep at visits 1-6). However, not all the women in the trial had their depression score recorded on all scheduled visits.

In this example, the data were measured at fixed, equally spaced, time points. (Visit is time as a factor and nVisit is time as a continuous variable.) There is one between-subject factor (Group, i.e. the treatment group, either placebo or estrogen treatment), one within-subject factor (Visit or nVisit) and a covariate (predep).

Using the following plots, we can explore the data. In the first plot below, the depression scores for each subject are plotted against time, including the baseline, separately for each treatment group.

In the second plot, the mean depression score for each treatment group is plotted over time. From these plots, we can see variation among subjects within each treatment group that depression scores for subjects generally decrease with time, and on average the depression score at each visit is lower with the estrogen treatment than the placebo.

### Random effects model

The simplest approach for analyzing repeated measures data is to use a random effects model with subject fitted as random. It assumes a constant correlation between all observations on the same subject. The analysis objectives can either be to measure the average treatment effect over time or to assess treatment effects at each time point and to test whether treatment interacts with time.

In this example, the treatment (Group), time (Visit), treatment by time interaction (Group:Visit) and baseline (predep) effects can all be fitted as fixed. The subject effects are fitted as random, allowing for constant correlation between depression scores taken on the same subject over time.

The code and output from fitting this model in ASReml-R 4 follows;

The output from summary() shows that the estimate of subject and residual variance from the model are 15.10 and 11.53, respectively, giving a total variance of 15.10 + 11.53 = 26.63. The Wald test (from the wald.asreml() table) for predep, Group and Visit are significant (probability level (Pr) ≤ 0.01). There appears to be no relationship between treatment group and time (Group:Visit) i.e. the probability level is greater than 0.05 (Pr = 0.8636).

### Covariance model

In practice, often the correlation between observations on the same subject is not constant. It is common to expect that the covariances of measurements made closer together in time are more similar than those at more distant times. Mixed models can accommodate many different covariance patterns. The ideal usage is to select the pattern that best reflects the true covariance structure of the data. A typical strategy is to start with a simple pattern, such as compound symmetry or first-order autoregressive, and test if a more complex pattern leads to a significant improvement in the likelihood.

Note: using a covariance model with a simple correlation structure (i.e. uniform) will provide the same results as fitting a random effects model with random subject.

In ASReml-R 4 we use the corv() function on time (i.e. Visit) to specify uniform correlation between depression scores taken on the same subject over time.

Here, the estimate of the correlation among times (Visit) is 0.57 and the estimate of the residual variance is 26.63 (identical to the total variance of the random effects model, asr1).

Specifying a heterogeneous first-order autoregressive covariance structure is easily done in ASReml-R 4 by changing the variance-covariance function in the residual term from corv() to ar1h().

### Random coefficients model

When the relationship of a measurement with time is of interest, a random coefficients model is often appropriate. In a random coefficients model, time is considered a continuous variable, and the subject and subject by time interaction (Subject:nVisit) are fitted as random effects. This allows the slopes and intercepts to vary randomly between subjects, resulting in a separate regression line to be fitted for each subject. However, importantly, the slopes and intercepts are correlated.

The str() function of asreml() call is used for fitting a random coefficient model;

The summary table contains the variance parameter for Subject (the set of intercepts, 23.24) and Subject:nVisit (the set of slopes, 0.89), the estimate of correlation between the slopes and intercepts (-0.57) and the estimate of residual variance (8.38).

### References

Brady T. West, Kathleen B. Welch and Andrzej T. Galecki (2007). Linear Mixed Models: A Practical Guide Using Statistical Software. Chapman & Hall/CRC, Taylor & Francis Group, LLC.

Brown, H. and R. Prescott (2015). Applied Mixed Models in Medicine. Third Edition. John Wiley & Sons Ltd, England.

Sabine Landau and Brian S. Everitt (2004). A Handbook of Statistical Analyses using SPSS. Chapman & Hall/CRC Press LLC.