The threat of p-value hacking to the integrity of plant and animal breeding

The threat of p-value hacking to the integrity of plant and animal breeding

The VSNi Team

17 January 2024
image_blog

P-values have become an indispensable tool in statistical inference, playing a crucial role in plant and animal breeding research. They provide a quantitative measure of the strength of evidence against a null hypothesis, typically the assumption that there is no effect of interest. However, the use of p-values has been the subject of ongoing debate and controversy, including in the context of analysing the complex data associated with plant and animal breeding programmes.

The Issue of P-Value Hacking

P-value hacking, also known as data dredging or significance fishing, refers to a set of problematical data analysis practices overly focused on obtaining statistically significant results (often a p-value < 0.05). These practices often involve multiple iterations of analyses, selectively including or excluding variables, or modifying methodologies until a desired outcome is achieved. In essence, it's a subtle form of data analysis manipulation (oftentimes unintentional) that leads to inflated significance, undermining the integrity of the research and resulting in misleading conclusions.

Implications to Plant and Animal Breeding 

In plant and animal breeding, scientific progress relies heavily on accurate statistical analyses to make informed decisions. The quest to identify significant associations, correlations or genetic markers that influence desirable traits is a fundamental aspect of any breeding programme. However, the practice of manipulating data analysis until a statistically significant result is achieved—known as p-value hacking — has serious implications: 

False Positives and Misleading Conclusions: P-value hacking increases the risk of producing false positives. These false signals of significance can mislead breeders into believing there's a genuine effect when, in reality, the observed result is merely a statistical artefact (of p-value hacking!). This leads to misleading conclusions about the associations between traits or markers, potentially influencing subsequent breeding strategies. 

Resource Misallocation: Breeding programmes often involve extensive resources, including time, finances and people power. When breeding strategies hinge on misleading findings derived from p-value manipulation or 'hacking', the allocation of resources toward incorrect pathways becomes a serious risk. This misallocation of resources will delay progress and may even prevent the aims of the breeding programme from being achieved. 

Stifling Innovation and Progress: Inaccurate or misleading findings resulting from p-value hacking can perpetuate incorrect assumptions. This can stifle innovation and impede genuine progress by steering researchers away from pursuing more promising avenues or novel genetic markers. 

Reputational Risks: P-value hacking not only undermines the integrity of the research but also poses reputational risks for individuals, research institutions and breeding companies. It can diminish trust in the reliability of research outcomes and hinder collaboration and knowledge sharing. 

Strategies to Minimise P-Value Hacking

Several strategies can help minimise p-value hacking in plant and animal breeding:

Transparent Methodologies: Fully document and freely disclose all analytical methodologies. Clear and complete documentation of the analyses performed, including variable selection criteria and data exclusion decisions, enhances research reproducibility and accountability.

Rigorous Peer Review: Engage in rigorous peer review that critically evaluates the research for potential p-value hacking. Peer reviewers should assess the experimental design, data analysis and interpretation of results to identify any red flags that may indicate questionable practices.

Robust Experimental Design: Employ rigorous experimental designs that minimise confounding factors and provide adequate replication to ensure valid and generalizable results. Well-designed experiments reduce the likelihood of spurious associations, increase the reliability of statistical inferences and improve the precision of estimated effects. Reputable software tools, such as CycDesigN, play a pivotal role in this process, providing breeders with a comprehensive platform for creating robust, reliable and efficient experimental designs. CycDesigN's advanced features empower breeders to implement meticulous methodologies, facilitating the creation of experiments that yield high-quality, dependable results.

Thorough Data Management: Implement meticulous data management practices, including proper documentation, version control and error checking. Consistent and transparent data management practices prevent data manipulation and ensure the integrity of the data used for analysis. For more on this topic, please check out our blog "Managing datasets for multi-environmental trials: best practices"

Exploratory Data Analysis: Conduct exploratory data analysis (EDA) to gain an understanding of the data and explore patterns before conducting formal statistical tests. EDA provides context and helps prevent misinterpretations that may arise from solely relying on significance tests.

Good Statistical Practices: Leverage the power of robust statistical methods to extract maximum value from your breeding data and generate meaningful, reliable results. Employ software tools like ASReml-R and Genstat, renowned for their reliability and reputation in breeding data analysis. Pre-planning the analysis strategy, including the specification of the hypothesis or hypotheses of interest, is also recommended.

Education and Training: Engage in training and education on responsible research practices, including the proper interpretation and application of p-values.

Emphasise Effect Size and Confidence Intervals: Focus on effect sizes and confidence intervals alongside p-values. Understanding the practical relevance and magnitude of observed effects provides a clearer perspective on the significance of findings, reducing the reliance solely on p-values.

Conclusion

The integrity of research findings is paramount to the success of plant and animal breeding programmes and scientific progress more broadly. However, p-value hacking poses a significant threat, potentially distorting outcomes and steering breeding efforts in erroneous directions. By adhering to transparent methodologies, emphasising robust statistical practices and fostering a culture of ethical research conduct, the breeding community can safeguard against the pitfalls of p-value hacking, ensuring the reliability and validity of scientific advancements in the field.

At VSNi, we are committed to providing researchers and breeders with the tools and expertise they need to conduct high-quality research. Our team of experts can assist you in:

  • Developing and implementing data management plans
  • Conducting exploratory data analysis
  • Employing robust statistical methods
  • Interpreting and communicating research findings accurately and effectively

If you are interested in learning more about how to improve the reliability of your research findings, we encourage you to contact us. We are here to help you achieve your research goals and contribute to the advancement of plant and animal breeding.

Further reading:

Demystifying the p-value: Understanding its meaning and application

Significance levels: a love-hate relationship

Never accept the null hypothesis: It’s both wrong and dangerous!

Decoding hypothesis testing: investigating type I and type II errors

What is the multiple comparison problem in statistics?

The power of confidence intervals: beyond p-values

From data to insights: improving plant breeding outcomesalt text