Navigating the Multiple Comparison problem in statistics: false positives and solutions

Navigating the Multiple Comparison problem in statistics: false positives and solutions

Dr. Salvador A. Gezan

28 September 2022

For the majority of statistical analyses, we find ourselves with the ‘Multiple Comparison Problem’. This is the case where we focus on statistical inference for several hypotheses simultaneously. In this blog, we discuss the multiple comparison problem in depth with examples.

So what is the multiple comparison problem in statistics? When a set of statistical inferences is considered simultaneously, the multiple comparison problem occurs. Due to this, there is a higher chance of finding a false positive.

Read on to find out more about the multiple comparison problem and how it can impact your results.

How does the multiple comparison problem impact statistics?

An example of the multiple comparison problem is: if we have t = 5 treatments to compare against each other, then there are a total of k = 10 tests to perform. With so many tests, we are likely to find by chance, if nothing is really different, at least one significant result (a false positive). This is where we can correct our testing procedure for multiple comparisons, through considering a stricter (lower) significance level. This has the disadvantage of reducing our statistical power to detect some real differences! Hence, what do we do? In these notes, we will discuss some of the elements associated with this problem and make recommendations based on how to address these issues.

What is the meaning of a significance level in statistics?

First, let’s refresh the meaning of a significance level. If you want more details, this was discussed previously in another blog. A significance level refers to a value, say α = 0.05, that we use to decide if we should reject a null hypothesis (e.g., no difference in the means between two treatments). This α value represents the proportion of times (here 5%) that we expect to reject the null hypothesis when in fact it is true. The issue with multiple comparisons is that we are now performing, say k, comparisons (10 in our case, calculated as k = t x (t - 1)/2), so the chance of at least one mistake is very high. The formula for the calculation of this mistake, expressed as alpha subscript f (the experiment-wise significance level) is:

alpha subscript f equals 1 minus left parenthesis 1 minus alpha right parenthesis to the power of k

alpha subscript f equals 1 minus left parenthesis 1 minus 0.05 right parenthesis to the power of 10 equals 0.401

Hence, with k = 10 tests, there is a 40% chance that we will incorrectly declare one or more pairs of means to be different. If we had a total of t = 15 treatments to compare, then we will need to do k = 105 comparisons, for which the probability of at least one mistake is 0.995!

What are some solutions for the multiple comparison problem?

As you can see this problem is concerning. In the following sections we will describe the range of options from doing nothing to a full experiment-wise protection. But keep in mind that these are not mutually exclusive as some analyses might be approached with more than one option.

1. Continue with our original significance level

One option is to continue using our original significance level, say α = 0.05, for each of the k comparisons (that is α = alpha subscript f). Yet, we have to recognise that this is likely to result in false positives. Therefore, depending on the magnitude of k, we should expect to report some excess of significant results. 

As long as we are aware of this, and we make this explicit when reporting our analysis, then this shouldn’t be much of a problem. Note that if the objective of the study is discovery, then having some false positives that will later be re-assessed in further studies/analyses (i.e., additional testing with fewer treatments) is not a bad thing. 

2. Use an experiment-wise significance level

Let’s move to the other extreme. In this case we will calculate an appropriate alpha subscript f, the experiment-wise significance level based on k events. This means that for ALL tests we are going to perform, we will set a significance level for an individual test (alpha subscript b) that will result in an alpha subscript f x 100% error rate (e.g., 5%) in the complete study. This is thinking of the experiment as a whole event, for which we do not want to report excessive false positive rates. The calculation of this alpha subscript b is based on the famous Bonferroni correction, and its expression is:

a subscript b space equals left parenthesis 1 minus a subscript f right parenthesis to the power of 1 divided by k end exponent space tilde space a subscript f end subscript space divided by k 

If we set our alpha subscript f to be 0.05 (5%) then the alpha subscript b for a single-test changes to 0.0050 and 0.0005 for multiple comparisons on 5 and 15 treatments, respectively. 

Now it is a lot harder to report significant results for a single test, but on the other hand we are being careful not to report too many false positives in the complete experiment. This is a conservative approach, but one of its main criticisms is that when the number of treatments, t, is large then in many cases the alpha subscript b is so small that nothing is significant! This may be what we want for protection against false positives, but it also might result in no true differences (or effects) being identified as we are limiting our statistical power greatly.

3. Implement one of the suggested post-hoc tests

In order to avoid the above extremes, there are several control methods available, which are often known as post-hoc tests. You’ve probably have heard of Duncan, Tukey, Scheffe, Dunnett and many other multiple comparison methods that are often reported. They all try to achieve a compromise between no control and full control. This is not the place to describe each of them, but they have interesting assumptions and are worth checking out in detail. 

The issue then is which test to use (unless this is chosen for you). This is a difficult decision, but most statistical software gives you options to request many of them simultaneously (for good or bad). However, the arbitrariness of the decision on which test to use is always going to be present, and this can lead to conflict with other scientists (and particularly reviewers!). In any case, the use of these tests should give you a reasonable compromise and some peace of mind!

4. Focus on planned comparisons

When planning your study, if you define clearly what your hypotheses of interest are and these correspond to a relatively small number, then, it is understood that you do not need any correction of your significance level. The key here is that: 

  1. your focus is on just those pre-planned comparisons, 
  2. these are an integral part of your experimental design, and 
  3. you will not perform additional comparisons (or hypothesis tests) after exploring your data.

This will ensure that you maintain the levels of false positive rates stated at the beginning of your study (for which you decided a reasonable level of risk beforehand).

Let’s illustrate this with an example. If the main objective of your study is to compare your treatment formulations against a commercial control, then you have only t - 1 comparisons of interest. For example, with t = 5, we will do 4 comparisons instead of 15. As each of these pre-planned comparisons represent your original hypotheses of interest, you can always use your pre-defined significance level with no adjustment. Here, you are saying that you are willing to report the comparison of each new formulation against the commercial control with only α x 100% (e.g., 5%) expected false positives. This logic is based on a probabilistic bonus of performing a carefully planned experiment (including good replication and power) with only a few reasonable comparisons.

In the above example, there is no pre-planned interest in comparing each treatment against each other, so why do that and risk reporting many false positives. Here, you have to think of gambling: the more times you bet, the more often you will lose!

How can the multiple comparison problem impact structured treatments?

An important aspect that we have not mentioned is when the treatments have a structure (for example, a two-way factorial arrangement). In this case, multiple comparisons are inappropriate, as we should focus on evaluating the main effects and their interactions instead of brute-force comparisons. Something similar happens when we have a factor that is better dealt as a continuous covariable providing a different (and often better) insight on the dynamics of this effect.

One element to be careful of is that most statistical software will allow you to implement, in just seconds, several of the multiple comparison tests. This is bad practice, as you end up picking the method that accommodates your pre-conception of the results of the study (and leading to scientific fraud). In order to avoid this, it is much better to have the testing method decisions made during the planning stage of your study. 

A few last words

As you have heard, the key for any of these studies is the specification of the hypothesis or hypotheses of interest before you do the analyses, and even better before you start the experiment. It is at that moment where the level of risk you are willing to commit is decided upon (and note that that does not have to be the traditional 0.01, 0.05 and 0.10 levels). Here, you determine this risk according to the study’s objective, the stage of research and the current and potential future uses of the study results.

Finally, you also have to consider that your decision to control or not to control the multiple comparison experiment-wise significance level will likely create some conflict or disagreement between you, other scientists, reviewers, and/or your target audience. In order to avoid this, it is always recommended you report all results. This includes:

  • treatment averages (or differences)
  • the size of the effects, a measure of variability, and 
  • the full set of p-values for the respective tests.

This will help the reader to decide if those numbers are sufficient evidence for them, and not just for you!

About the author

Dr. Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world. 

Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook Statistical Methods in Biology: Design and Analysis of Experiments and Regression.