Dr. Vanessa Cave
10 August 2021Read any form of scientific document (e.g., journal paper, client report, media article, etc) and it won’t take you long to find an example of an incorrect statement based on a p-value. A frighteningly common mistake is that non-significant p-values are often misinterpreted. Both researchers and their audience too frequently make the grave error of accepting the null hypothesis – and this can lead to potentially dangerous outcomes.
Consider the following scenario:
A pharmaceutical company has developed a new drug for treating hypertension (i.e., high blood pressure) and they want to prove it is better than the routinely used standard drug. In particular, they want to demonstrate that, compared to the standard drug, their new drug is…
To do this, their team of researchers designed and conducted medical trials, then performed hypothesis tests on the resulting data to generate p-values.
Let’s begin by considering the first research aim - to demonstrate that the new drug is more effective at lowering the patient’s blood pressure to the target level than the standard drug. The researchers construct an appropriate statistical test based on the following null () and alternative () hypotheses:
: the new drug is no more effective than the standard drug
: the new drug is more effective than the standard drug
The resulting p-value measures the strength of evidence against the null hypothesis provided by data. The smaller the p-value, the stronger the evidence[1]. When the p-value is sufficiently small, the null hypothesis can be rejected in favour of the alternative hypothesis. In other words, with a small p-value the research team may claim that there is statistically significant evidence the new drug is more effective than the standard drug.
So, what can the researchers conclude if the p-value is large (i.e., statistically non-significant)? Essentially nothing!
All the researchers can say is that they have failed to reject the null hypothesis. That is, they haven’t been able to demonstrate that the new drug is more effective than the standard drug.
So, should the pharmaceutical company decide that the new drug isn’t as effective as the standard drug and assign it to the bin? Certainly not!
Just because a p-value is large doesn’t mean that the null hypothesis is true. All a hypothesis test does is measure the strength of evidence against the null hypothesis. That is, we assume the null hypothesis is true until we have enough evidence to reject. Crucially, we never actually claim that the null hypothesis is true - it is just an assumption!
Let’s explore this important concept in relation to the second research aim - to demonstrate that the new drug is the same as the standard drug with respect to unwanted side-effects.
Our imaginary pharmaceutical research team now construct a significance test with the following null and alterative hypotheses …
: the unwanted side-effects of the new drug are the same as the standard drug
: the unwanted side-effects of the new and standard drugs are different
Would you be happy to use the new drug based on a large (i.e., statistically insignificant) p-value? I hope not!
The distinction between demonstrating a difference and failing to demonstrate a difference is subtle but very important. Remember, failing to demonstrate a difference is not the same as proving no difference! As discussed above, we never accept the null hypothesis.
So, how can the researchers prove that the side-effects are the same? They must reverse the onus of proof and perform a bio-equivalence test[2].
When we wish to prove a difference, we take the stance of assuming no difference and require proof that this assumption is wrong. Conversely, when we wish to prove things are the same, we must reverse the onus of proof. That is, we take the stance of assuming a difference until we see proof that they are effectively the same[3].
Indeed, failing to reverse the onus of proof leads to illogical results. The smaller your sample size, the more likely it is that you will get a non-significant p-value. So “proof” by absence of a statistically significant effect would favour low-powered experiments … which is clearly nonsensical!
Take home message: It is both wrong and dangerous to accept the null hypothesis |
In all forms of hypothesis testing, the null hypothesis is not demonstrated by a non-significant result. Accepting the null hypothesis is not just methodologically wrong, in the real world it leads to poor decision making and potentially very dangerous consequences. Let’s consider the potential impact for patients should the pharmaceutical company erroneously accept the following null hypotheses:
: the new drug is no more effective than the standard drug
→ Erroneously accepting the null hypothesis may lead to a new drug, which in reality is indeed better than the standard drug, being disregarded by the pharmaceutical company. Patients would then miss out on the benefits of a superior treatment.
: the unwanted side-effects of the new drug are the same as the current drug
→ Erroneously accepting the null hypothesis could result in the pharmaceutical company releasing a drug with terrible side-effects for the patients.
Worryingly, all too often researchers report no statistically significant effect and then discuss their results as if they had proven the null hypothesis. As our examples attempt to illustrate, this fundamental mistake leads to potentially dangerous outcomes. I strongly believe that everyone calculating a p-value, or reporting results based on one, has the responsibility to ensure that they:
Failure to do so may have dire consequences.
I share more of my views and cautionary messages on p-values in a co-authored Guest Editorial for the New Zealand Veterinary Journal, available here: https://www.tandfonline.com/doi/full/10.1080/00480169.2018.1415604.
[1] A commonly used guideline for describing the strength of evidence against the null hypothesis, as provided by the p-value, is: p-value < 0.001 indicates very strong evidence against the null hypothesis, 0.001 ≤ p-value < 0.01 indicates strong evidence, 0.01 ≤ p-value < 0.05 indicates moderate evidence, 0.05 ≤ p-value < 0.1 indicates weak evidence, and a p-value ≥ 0.1 indicates insufficient evidence.
[2] A bio-equivalence test is required when the aim of the study is to show that two treatments are effectively the same. A superiority hypothesis test is required when the aim is to show that one treatment is superior to another. A non-inferiority hypothesis test is required when the aim is to show that one treatment is not inferior to another. See, for example, Piaggio et al. 2006 for more information.
[3] Exact equality cannot be proven. Instead, we must define a range of values, known as the bio-equivalence range, that is deemed to be effectively the same.
Dr Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems. Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.
Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.
Related Reads