Managing datasets for multi-environmental trials: best practices

Managing datasets for multi-environmental trials: best practices

The VSNi Team

19 May 2023

Multi-environmental trials (MET) are an essential part of plant breeding programmes to select the best cultivars for a zone or for specific environments. However, managing the large datasets generated by these trials can be a challenging task. In this blog, we will discuss some of the best practices for managing datasets for MET analyses.

Flexibility to manage a variety of data

MET data can come from a variety of sources, including different crops, markets or breeding zones, making it a challenge to manage. Flexibility in the data management system is therefore essential to ensure that it can handle dozens of traits (or measurements) and different data types, including phenotypic and molecular data, and be able to accommodate an array of experimental designs and breeding objectives. In addition, it needs to handle different data granularity; such as measurements at the plot level, plant level or even leaf level. Consequently, the use of software for data management without such flexibility can lead to inconsistencies and loss of accuracy about crop performance.

Value in accumulated data

Plant breeding programmes often collect data over several years, or even decades, providing insights into the performance of crops across varied environments and under different management conditions. This MET data can have many additional applications, such as identifying trends, future predictions, and developing new breeding strategies. Accumulated data over time can be incredibly valuable for future research and applications, including machine learning (ML) and artificial intelligence (AI). Our consulting services specialise in incorporating historical and accumulated data into your analyse pipeline. Investing in robust data management that can accumulate data over generations is wise for plant breeding programmes as it can save time, increase efficiency, and ultimately lead to better performance and faster breeding outcomes. 

Metadata planning

In plant breeding trials, it is essential to consider and manage several types of metadata. One critical type of metadata is experimental design information, which describes the layout of the trial, the types of data that will be collected, and the statistical methods that will be used to analyse the results. Other metadata that should be considered include measurement protocols, and trial management details such as fertilization, irrigation, and pest control. CycDesigN is a tool which will help generate experimental designs for your trials and supports gains in the efficiency and effectiveness of your tests. Planning for metadata storage and management beforehand ensures the most relevant and accurate data is collected, saving time and resources. 

Handling repeated measures

It is often necessary to measure crops multiple times over the growing season. This is particularly important when studying traits that change over time, such as plant height, yield or susceptibility to diseases. However, handling repeated measures can be a lot of work, requiring additional checks and slightly different types of data storage and manipulation protocols. One of the main issues is ensuring that the same plant or plot is measured consistently over the trial duration. It may also be necessary to take additional measurements over time, such as soil moisture or temperature, to account for environmental variability between measurements. Additionally, repeated measures may produce large amounts of data that can be complex to analyse. Therefore, it is important to have a well-designed data management strategy, which may involve storing data in a structured format that allows for easy manipulation and analysis or using analysis software designed specifically for handling repeated measures data. All of this ultimately leads to increased precision and better use of the available information.

Molecular data management

Molecular data is important in MET analysis and plays a crucial role in the identification of genes associated with desirable traits and in the use of predictive tools such as genomic selection. However, molecular data requires a different type of management than what is traditionally done for phenotypic data. This is because molecular data is often high throughput, meaning that large volumes of data are generated in a short amount of time. Additionally, it is crucial to ensure that the data is well organised and properly annotated, with detailed information about the genotypes, markers, and experimental conditions (i.e., good metadata). Despite often being an expensive investment, effective management of molecular data is essential for the success of plant breeding programmes as it allows breeders to identify and select plants with desirable traits more accurately and efficiently. 

Data consistency and definitions

To ensure the accuracy of statistical analyses in plant breeding programmes, it is important to check the data for consistency before performing any analysis. This involves addressing issues such as outliers, large effects of diseases, stressors, or sources of variability, which could affect the results. In addition, it is also essential to maintain consistency in the definitions of factors and covariates when combining data from current and historical trials. Redundancies should be checked to ensure proper data formation. By ensuring consistency in data definitions and protocols, plant breeders can be confident they are making informed selections based on accurate and reliable data.

Performing statistical analyses

After data cleaning and filtering, statistical analyses can be performed using statistical software. ASReml-R is a powerful and trustworthy data analysis software package commonly used for this purpose in plant breeding programs. This package offers a wide range of linear mixed models that can be fitted to analyse data from multi-environmental trials. The models fitted can be as simple or as complex as required, depending on the nature of the data and the objectives of the breeding program. In some cases, one-stage models can be used to analyse data from all environments simultaneously. However, in other cases, two-stage models may be more appropriate, where each environment is analysed separately in the first stage, and the results are combined in a second stage.

The choice of model and analytical strategy depends on several elements, including the number of environments, the size of the dataset, and the complexity of the genetic and environmental effects. The results of statistical analyses provide valuable information on the performance of genotypes across different environments, the heritability of traits, and the genetic correlations between environments, all of which helps us understand genotype-by-environment interaction. Such information aids plant breeders in making informed decisions about which genotypes to advance in their breeding programme for which zone, and which traits to prioritisze for improvement.

Data preservation for future analyses

Once statistical analyses have been performed, the results of breeding values or total genetic values should be stored for future use. Keeping this data is essential for future decision-making, particularly selection. It is also crucial to store metadata related to these analyses, including type of model and data used, software version, and the criteria for filtering. Once more, properly managing this metadata is important for making the analysis process repeatable and for ensuring that future analyses are conducted with the same references and methods. By storing both, the results and metadata, from statistical analyses, plant breeders can access and utilise valuable data that can inform future breeding strategies.

Final thoughts

Managing data for MET can be challenging, but it is an essential investment. A clean, reliable, and easily accessible dataset can provide more reliable and quicker results, leading to greater genetic gains and faster release of varieties. Moreover, we never know what else we might want to use this data for in the future. VSNi consultancy services are here to support researchers in effectively managing data for plant breeding trials. With our expertise, we can assist in implementing robust data management practices, ensuring the cleanliness, reliability, and accessibility of your datasets. By leveraging our consultancy services, you can achieve more reliable and quicker results. An investment in your data today is also an investment in the future for your breeding programme. 

For more information on how VSNi consultancy services can assist you in managing your data for plant breeding trials, please don't hesitate to contact us.

Related blogs