Big Data and technology: How can smallholder farmers benefit?

Big Data and technology: How can smallholder farmers benefit?

Dr. Jean-Marcel Ribaut

02 February 2022

The Integrated Breeding Platform

Hello, my name is Jean-Marcel Ribaut and I am Director of the Integrated Breeding Platform (IBP). We have been working with VSNi for almost 10 years on the development and distribution of data management software known as the Breeding Management System (BMS), a software suite specifically designed to help plant breeders manage day-to-day activities through all phases of their breeding programmes. The BMS helps breeders conduct their routine activities with more efficiency, so that they can develop improved cultivars faster and at lower costs; it combines information management, data analysis and decision-support tools that accommodate common breeding schemes. The BMS is important in the demand-led breeding approach, however, it is also part of a broader landscape that includes farming practices and seed distribution that are common bottlenecks along the crop value chain; it is this unifying approach that brings me to the subject of today’s blog: How can smallholder farmers benefit from advances in technology and access to big data?

Before going further, it is important to clarify that farmers are not going to access big data directly and nor do they have an interest in it; they want the knowledge and information that may be extrapolated from the big data and, thanks to the ready availability and low cost of smart phones and other digital technologies, they have the means by which they can use the output of big data analysis and machine learning approaches.

Why smallholder farmers?

Smallholder agriculture is the backbone of many developing country economies, both in terms of food production and employment. Recent studies have shown that about 500 million smallholder farming households globally account for one-third of total food production. This is significant at several levels and good reason to apply the latest tools in data and technology to support smallholder producers. There is considerable potential for smallholder producers to increase their production and incomes while maintaining their environment and meeting the demands of newly emerging markets. A perfect storm, but in a nice way!

An holistic approach is needed

The IBP has focused on data management for the development of new crop cultivars; plant breeding has been our take-off point. However, we have come to realise that new cultivars, per se, are not going to change the livelihoods of smallholder farmers in the absence of an enabling environment that will include data and information about new cultivars, the availability and cost of seed, good agricultural practices and access to markets. Therefore, we are broadening the original focus on plant breeders to adopt a more holistic or value chain approach to the development, distribution and utilisation of new cultivars. In doing so, we place smallholder farmers at the centre of this new paradigm: their engagement and ownership of the process is vital. With it comes the foundations of a sustainable business model that may be scaled out geographically and across commodities. 

The inputs needed by smallholder farmers include not just seed, fertiliser and other physical inputs but also advice, and it is here that the power of large databases and information and communication technologies becomes more obvious. Increasingly, farmers need to make predictions about their crops and inputs and this is where new and powerful algorithms and artificial intelligence can be combined, together with vast amounts of data to answer the questions they want to ask. 

It is a paradigm shift to go from posing a question and then collecting the data to accessing huge datasets with modern tools and formulating the question.

For example, geo-located data about crop development and weather conditions, collected during the crop cycle, can be used immediately to inform the future development of varieties and optimize management practices to maximize harvest. Climate change is happening and climate resilience will take many shapes and forms, including the ability for smallholder farmers to pre-empt future rainfall and temperature patterns and to optimise their requirements for the judicious use of pesticides and fertiliser = Green Agriculture. Similarly, with urbanisation and other demographic changes across developing countries there is an associated shift in market demands. The ability of smallholder farmers to respond to these demands, armed with the best available knowledge and information will be paramount in their quest for improved livelihoods and sustainability.

About the Author

Jean-Marcel holds a PhD in Plant Physiology from Lausanne University, Switzerland. His scientific background is in plant physiology and genetics. Jean-Marcel was first the Director of the CGIAR Generation Challenge Program (GCP), where he was responsible for leading and coordinating a large international network of partners in modern crop breeding for food security (2005-2015). Born of the GCP, the IBP went on to become an in dependent non-profit, and Jean-Marcel has taken on the role of IBP Director ever since. Having accumulated experience in agriculture biotechnology, bioinformatics, and plant science, as well as leadership skills for dispersed global R&D teams, he has a particular interest in promoting modern technologies to hasten crop improvement and seed production in the developing world. Jean-Marcel's underlying motivation is to help  weave effective and interactive communities of crop researchers to bridge the gap between basic and applied agricultural science. He believes in true partnership and solid capacity building to overcome the bottlenecks in R4D, working with developing-country partners as key actors to lead the research arena and empower a culture of change.

IBP October 2021 newsletter

Related Reads


Dr. Andrew Illius and Dr. Nick Savill

20 July 2022

Quantification holds the key to controlling disease


Andrew Illius with Nick Savill have, since 2018, studied the epidemiology and control of maedi-visna virus (MV) in sheep and have been looking at understanding and finding ways of controlling this incurable disease. Accessing published data and with the use of Genstat, they aimed to find ways of controlling MV.

When one of your sheep gets diagnosed with an incurable disease, you have to worry. How quickly will symptoms develop, and welfare and productivity suffer? And how soon will it spread throughout the rest of the flock? The disease in question is maedi-visna (MV, see Box 1), notorious for its impact in Iceland, where the disease was first described; extreme measures over 20 years were required before it was finally eliminated. Culling seropositive animals is the main means of control. For the farmer, the crucial question is whether living with the disease would be more expensive than trying to eradicate it. We are addressing such questions by analysing data from long-term experiments. 

1            MV – the tip of an iceberg?

Putting aside for a moment MV’s fearsome reputation, the way the pathogen works is fascinating. The small ruminant lentiviruses (SRLV, family retroviridae) are recognised as a heterogeneous group of viruses that infect sheep, goats and wild ruminants. Lentiviruses target the immune system, but SRLV does not target T-cells in the manner of immune deficiency lentiviruses such as HIV. Instead, SRLV infects monocytes (a type of white blood cell) which infiltrate the interstitial spaces of target organs (such as the lungs, mammary glands, or the synovial tissue of joints) carrying proviral DNA integrated into the host cell genome and hence invisible to the immune system. Virus replication commences following maturation of monocytes into macrophages, and the ensuing immune response eventually shows up as circulating antibodies (termed seroconversion). But it also causes inflammation that attracts further macrophages, slowly and progressively building into chronic inflammatory lesions and gross pathology. These take years to present clinical symptoms, hence the name lentivirus (from the Latin lentus, slow). By the time clinical signs become evident in a flock, the disease will have become well-established, with perhaps 30-70% of the flock infected. That is why MV is called one of the iceberg diseases of sheep – for every obviously affected individual, there are many others infected, but without apparent symptoms. 

A large body of research into the pathology, immunology and molecular biology of small ruminant lentiviruses (SRLV) exists, as might be expected given its economic significance, welfare implications and its interest as a model for HIV. The main route of transmission of the virus is thought to be horizontal, via exhaled droplets of the highly infectious fluid from deep in the lungs of infected animals, suggesting a risk from prolonged close contact, for example in a sheep shed. But despite all the research into disease mechanisms, we were surprised to find that there has been almost no quantitative analysis of SRLV epidemiology, nor even an estimation of the rate of SRLV transmission under any management regime. So, our first foray into the data aimed to rectify this

Quantification to the rescue

We found an experiment published in 1987 with excellent detail on a five-year timecourse of seroconversions in a small infected sheep flock, and a further trawl of the Internet netted a PhD thesis that built on this with a focussed experiment. Karianne Lievaart-Peterson, its author, runs the Dutch sheep health scheme and became a collaborator. We also worked with Tom McNeilly, an immunologist at the Moredun Research Institute.

Nick Savill, a mathematical epidemiologist at Edinburgh University, did the hard work of developing and parameterising a mathematical model based on infectious disease epidemiology and a priori known and unknown aspects of SRLV biology. The model determines the probability of a susceptible ewe seroconverting when it did, and of a susceptible ewe not seroconverting before it was removed from the flock or the experiment ended. The product of these probabilities gives the likelihood of the data given the model. The model was prototyped in Python and then written in C for speed. 

The striking result of this research is that MV is a disease of housing. Even brief periods of housing allow the virus to spread rapidly, but transmission is negligible between sheep kept on pasture So, although individual sheep never recover from the disease, it could be eliminated from flocks over time by exploiting the fact that transmission of the virus is too slow between grazing sheep to sustain the disease.

It’s about timing

Our second striking result suggests the disease is unlikely to be spread by newly-infected animals, contrary to general expectation. We estimated that the time between an animal being infected and becoming infectious is about a year. This delay, termed epidemiological latency, is actually longer than the estimated time delay between infection and seroconversion.

We can now begin to see more clearly how disease processes occurring in the infected individual shape what happens at the flock, or epidemiological, level. It seems that, after a sheep becomes infected, the disease slowly progresses to the point when there is sufficient free virus to be recognised by the immune system, but then further development of inflammatory lesions in the lungs has to take place before there are sufficient quantities of infective alveolar macrophages and free virus for transmission by the respiratory route. There follows a further delay, perhaps of some years, before the disease has advanced to the stage at which symptoms such as chronic pneumonia and hardening of the udder become apparent. 

Infectiousness is expected to be a function of viral load, and although we do not know the timecourse of viral load, it seems most likely that it continues to increase throughout the development of chronic disease. This suggests to us that the infectiousness of an individual is not constant, but is likely to increase as the disease progresses and symptoms emerge. 

We are interested in learning how infectiousness changes over the course of an individual’s infection because of the implications at the epidemiological level. Time delays in seroconversion merely make the disease more difficult to detect and control, but the epidemiological significance of a time delay in the development of infectiousness is that it acts to slow the spread of the virus. And if ewes with long-standing infections are the most infectious, they pose the greatest risk to uninfected sheep. This would present an opportunity for the management of exposure to slow the spread of disease. For example, if ewes in their last year of production are the most infectious, then young ewes should be kept away from them when housed – an idea supported by preliminary analysis using individual-based modelling (IBM – see Box 2). Separation of younger animals from older ones may reduce the prevalence of infection to the point where the costs of disease, in terms of lost production and poor welfare, are not overwhelming or at least are less than the cost of attempting to eliminate the disease – we discuss this later in this blog.

So far, there is only very limited and tentative evidence of increasing infectiousness in the literature, and direct experimental evidence would be very hard to come by. But it is plausible that disease severity, viral load and impaired productivity are all related to the extent of inflammatory lesions in the lungs. This suggests that measurably-impaired productivity in infected sheep could be used as a proxy for viral load, and hence infectiousness. And that brings us to our current project.

2            Individual-based modelling

This is a technique to explore the consequences of probabilistic events, such as becoming infected by SRLV. The flock of ewes is modelled as individuals, and their progress through life is followed. Flock replacements are taken from the ewe lambs born to the flock; all other lambs being sold.

alt text

The figure shows the mean results (green line) of 1000 iterations of a stochastic simulation of SRLV prevalence in a flock of 400 ewes housed in groups of 100 for one month per year. The probability that an infected ewe will transmit the virus is modelled as rising exponentially with time since infection. The management regime we modelled was to segregate ewes during housing into each of their four age groups (2, 3, 4 and 5 years old) in separate pens, and to sell all the lambs of the oldest ewes, rather than retain any as flock replacements. From an initial starting prevalence of 275 infected ewes, the virus is virtually eliminated from the flock.

Using Genstat to model the cost of MV

Eliminating SRLV from an infected flock involves either repeated testing of the whole flock, culling reactors and perhaps also artificially rearing lambs removed at birth, or entirely replacing the flock with accredited disease-free stock. So, the cost of eliminating the virus from a flock can be huge. But what about the costs of living with it? These costs arise from poor welfare leading to lost production: lactation failure, reduced lamb growth and excess lamb and ewe mortality. But under what conditions are they so substantial as to warrant an elimination strategy? That depends, again, on answers at two levels: what are the production losses for each infected ewe, and how prevalent is the disease in the flock?

We have a number of reasons to want to quantify how the costs of being SRLV+ vary over the time-course of the disease. First, it is reasonable to assume that production losses will be related to the emergence of symptoms in affected sheep, but this has never been adequately quantified. Second, if production losses are a function of the duration of infection, and we can regard them as a proxy for viral load, then it would support the idea that infectiousness also increases as the disease progresses. And third, if production losses are only apparent in sheep with long-standing infections, which is therefore restricted to older sheep, then management could focus on early detection of symptoms and culling of older ewes. 

We are quantifying these processes using a large dataset from colleagues at the Lublin University of Life Sciences. Their six-year study was designed to assess the response of production parameters to SRLV infection in a flock of breeding sheep kept under standard Polish husbandry conditions. They published results suggesting that infection with SRLV was associated with higher rates of age-specific ewe mortality, especially in older ewes. 

The data comprise lambing records for about 800 ewes from three breeds, with over 300 ewes being present each year and a few being present every year. There are also records from about 2800 lambs born during the trial. Ewes were blood-tested in November and June each year, and all SRLV+ ewes were housed together following the November test until the lambs were weaned in April. SRLV- ewes were housed in the same shed, but segregated from the SRLV+ group. We were able to group the ewes on the basis of the series of blood test results as: (1) seroconverted before the trial began, (2) had not seroconverted by the end, and (3) seroconverted during the trial and for whom a time since seroconversion can be estimated to within about six months.

Given the nature of the data - unbalanced design, multiple observations from individual ewes and rams over several years, and different breeds – we used Genstat to fit mixed models to distinguish random and fixed effects. We were given access to Genstat release 22 beta, which adds greater functionality for displaying and saving output, producing predictions and visualising the fit of the model.

The example below addresses pre-weaning lamb mortality (mort, either 0 or 1). We are using a generalized linear mixed model where significant fixed terms were added stepwise. The ewes and rams used to produce these lambs are obvious random terms because they can be regarded as being drawn at random from a large population. There also appears to be a strong ewe.ram interaction, with some combinations faring differently from others. We included ‘year’ as a random term because, over the six years in which data were collected, factors such as flock size and forage quality varied somewhat randomly.

The fixed terms show that the probability of mortality is strongly affected by lamb birthweight (lambbirthwt). A quadratic term (lb2) models the well-known reduction in lamb survival in very large lambs - a consequence of birth difficulties. The age of the ewe, fitted as a factor (eweageF), is the next most significant fixed effect, followed by the SRLV status of the ewe tested in November prior to lambing (ewetestNov). The interaction term of ewe age with SRLV status is highly significant, showing that the way the ageing process in ewes affects the probability of their lambs’ mortality differs according to SRLV status. From the table of back-transformed means, we see that the probability of lamb mortality ranges between about 0.02 to 0.04 in SRLV- ewes aged from 2 to 5 years, perhaps declining in older ewes. SRLV+ ewes show similar lamb mortality in ages 2-4, but a progressive increase as ewes age further, almost doubling each year.

This preliminary analysis provides some evidence that the costs of being infected by SRLV are, indeed, progressive with age. There is some way to go yet to show whether sheep with longer-standing SRLV infection have higher viral loads and are more infectious, but our current research does point to a way to potential better disease control by targeting older animals. Maedi-visna isn’t called a progressive disease for anything, and we should be able to exploit that.

Example output from Genstat

alt text

We finally submitted our paper for publication in November 2019, just before the Covid 19 pandemic. One might have thought that a paper on the epidemiology and control of a respiratory virus spread by aerosol, especially indoors in close proximity and with a recommendation for social distancing, would have seemed quite topical. Ironically, early 2020 saw the subeditors and reviewers of such work all being urgently re-allocated to analysing information on the burgeoning pandemic. But we got there eventually and by October we were proudly writing a press release recommending social distancing … in sheep.

Reflecting on Genstat

Andrew Illius writes, “My experience of Genstat dates back to the early 1980s when I think Release 3 was current. It was hard going, and we queued up to have our fault codes diagnosed at the Genstat Clinic. But having learnt Fortran programming in the punched cards era, I was used to it taking several days to get a job to run. Genstat’s exacting requirements were reassuring and it became indispensable over the following years of agricultural and ecological research. By the 1990s we had learnt that mixed models were required to account separately for random and fixed effects in unbalanced data, and I’d been on a REML course. I was especially proud to use REML as my main analytical procedure thereafter because Robin Thompson invented it just down the corridor where we work in the Zoology building at Edinburgh University, and where he worked with the Animal Breeding group. It’s been a tremendous pleasure to get back to Genstat recently after many years away – like greeting an old friend. In the past, I wrote and submitted batch jobs on a UNIX mainframe before collecting some line-printer output on my way home. Now things have really speeded up, with the menu-driven environment of the Windows version. It’s a fantastic improvement, and a pleasure to use.”

About the authors

Andrew Illius is Emeritus Prof of Animal Ecology in the Institute of Evolutionary Biology, University of Edinburgh, where he taught animal production and animal ecology from 1978 to 2008 and was latterly Head of the School of Biological Sciences. Most of his work has been on the ecology and management of grazing systems and the ecophysiology and behaviour of grazing animals. He retired in 2008 to spend more time with his sheep, keeping about 400 breeding ewes. Familiarity with sheep diseases led to collaboration with Nick Savill since 2018 on the epidemiology and control of MV.

Nick Savill is a Senior Lecturer at the Institute of Immunology and Infection Research, University of Edinburgh. He teaches a range of quantitative skills to undergraduate biological science students including maths, stats, data analysis and coding. His research interests are in mathematical modelling of infectious disease epidemiology. He has worked on foot and mouth disease, avian influenza, malaria, trypanosomiasis and, most recently, maedi-visna with Andrew Illius.


Illius AW, Lievaart-Peterson K, McNeilly TN, Savill NJ (2020) Epidemiology and control of maedi-visna virus: Curing the flock. PLoS ONE 15 (9): e0238781.


Kanchana Punyawaew

01 March 2021

Linear mixed models: a balanced lattice square

This blog illustrates how to analyze data from a field experiment with a balanced lattice square design using linear mixed models. We’ll consider two models: the balanced lattice square model and a spatial model.

The example data are from a field experiment conducted at Slate Hall Farm, UK, in 1976 (Gilmour et al., 1995). The experiment was set up to compare the performance of 25 varieties of barley and was designed as a balanced lattice square with six replicates laid out in a 10 x 15 rectangular grid. Each replicate contained exactly one plot for every variety. The variety grown in each plot, and the coding of the replicates and lattice blocks, is shown in the field layout below:

alt text

There are seven columns in the data frame: five blocking factors (Rep, RowRep, ColRep, Row, Column), one treatment factor, Variety, and the response variate, yield.

alt text

The six replicates are numbered from 1 to 6 (Rep). The lattice block numbering is coded within replicates. That is, within each replicates the lattice rows (RowRep) and lattice columns (ColRep) are both numbered from 1 to 5. The Row and Column factors define the row and column positions within the field (rather than within each replicate).

Analysis of a balanced lattice square design

To analyze the response variable, yield, we need to identify the two basic components of the experiment: the treatment structure and the blocking (or design) structure. The treatment structure consists of the set of treatments, or treatment combinations, selected to study or to compare. In our example, there is one treatment factor with 25 levels, Variety (i.e. the 25 different varieties of barley). The blocking structure of replicates (Rep), lattice rows within replicates (Rep:RowRep), and lattice columns within replicates (Rep:ColRep) reflects the balanced lattice square design. In a mixed model analysis, the treatment factors are (usually) fitted as fixed effects and the blocking factors as random.

The balanced lattice square model is fitted in ASReml-R4 using the following code:

> lattice.asr <- asreml(fixed = yield ~ Variety,
                        random = ~ Rep + Rep:RowRep + Rep:ColRep,

The REML log-likelihood is -707.786.

The model’s BIC is:

alt text

The estimated variance components are:

alt text

The table above contains the estimated variance components for all terms in the random model. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. The variance components for Rep, Rep:RowRep and Rep:ColRep are estimated as 4263, 15596, and 14813, respectively. As is typical, the largest unit (replicate) is more variable than its sub-units (lattice rows and columns within replicates). The "units!R" component is the residual variance.

By default, fixed effects in ASReml-R4 are tested using sequential Wald tests:

alt text

In this example, there are two terms in the summary table: the overall mean, (Intercept), and Variety. As the tests are sequential, the effect of the Variety is assessed by calculating the change in sums of squares between the two models (Intercept)+Variety and (Intercept). The p-value (Pr(Chisq)) of  < 2.2 x 10-16 indicates that Variety is a highly significant.

The predicted means for the Variety can be obtained using the predict() function. The standard error of the difference between any pair of variety means is 62. Note: all variety means have the same standard error as the design is balanced.

alt text

Note: the same analysis is obtained when the random model is redefined as replicates (Rep), rows within replicates (Rep:Row) and columns within replicates (Rep:Column).

Spatial analysis of a field experiment

As the plots are laid out in a grid, the data can also be analyzed using a spatial model. We’ll illustrate spatial analysis by fitting a model with a separable first order autoregressive process in the field row (Row) and field column (Column) directions. This is often a useful model to start the spatial modeling process.

The separable first order autoregressive spatial model is fitted in ASReml-R4 using the following code:

> spatial.asr <- asreml(fixed = yield ~ Variety,
                        residual = ~ar1(Row):ar1(Column),
                        data = data1)

The BIC for this spatial model is:

alt text

The estimated variance components and sequential Wald tests are:

alt text

alt text

The residual variance is 38713, the estimated row correlation is 0.458, and the estimated column correlation is 0.684. As for the balanced lattice square model, there is strong evidence of a Variety effect (p-value < 2.2 x 10-16).

A log-likelihood ratio test cannot be used to compare the balanced lattice square model with the spatial models, as the variance models are not nested. However, the two models can be compared using BIC. As the spatial model has a smaller BIC (1415) than the balanced lattice square model (1435), of the two models explored in this blog, it is chosen as the preferred model. However, selecting the optimal spatial model can be difficult. The current spatial model can be extended by including measurement error (or nugget effect) or revised by selecting a different variance model for the spatial effects.


Butler, D.G., Cullis, B.R., Gilmour, A. R., Gogel, B.G. and Thompson, R. (2017). ASReml-R Reference Manual Version 4. VSN International Ltd, Hemel Hempstead, HP2 4TP UK.

Gilmour, A.R., Thompson, R. & Cullis, B.R. (1995). Average Information REML, an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51, 1440-1450.


Dr. Vanessa Cave

13 December 2021

ANOVA, LM, LMM, GLM, GLMM, HGLM? Which statistical method should I use?

Unsure which statistical method is appropriate for your data set? Want to know how the different methods relate to each one another? 

The simple diagram below may help you.

alt text

Treatment factorCategorical explanatory variable defining the treatment groups. In an experiment, the experimental units are randomly assigned to the different treatment groups (i.e., the levels of the treatment factor).
Blocking variableFactor created during the design of the experiment whereby the experimental units are arranged in groups (i.e., blocks) that are similar to one another. You can learn more about blocking in the blog Using blocking to improve precision and avoid bias.
Continuous predictorA numeric explanatory variable (x) used to predict changes in a response variable (y). Check out the blog Pearson correlation vs simple linear regression to learn more.
Unbalanced designAn experimental design is unbalanced if there are unequal sample sizes for the different treatments. Genstat provides users with a tool to automatically determine whether ANOVA, LM (i.e., regression) or LMM (i.e., a REML analysis) is most appropriate for a given data set. Watch this YouTube video  to learn more.
Temporal correlationOccurs when repeated measurements have been taken on the same experimental unit over time, and thus measurements closer in time are more similar to one another than those further apart. To learn more, check out our blog A brief introduction to modelling the correlation structure of repeated measures data.
Spatial correlationOccurs when experimental units are laid out in a grid, for example in a field trial or greenhouse, and experimental units that are closer together experience more similar environmental conditions than those which are further apart. For more information, read our blog A brief look at spatial modelling.
Random effectsRepresents the effect of a sample of conditions observed from some wider population, and it is the variability of the population that is of interest. The blog FAQ: Is it a fixed or random effect? can help you understand the difference between fixed and random effects.

About the author

Dr Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems.  Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.