Dr. Salvador A. Gezan

a month agoI recently read a very interesting opinion piece published in The Guardian. The author talks about the impact the **Human Genome Project** (HGP) has had 20 years after the first draft of the human genome was published. Of course, this has been a great accomplishment, and today it is possible to have whole genome sequencing done in less than one week and for a fraction of the original cost. Now there are many more full genomes available for different animal and plant species. These constitute great scientific and technological advancements, and one cannot stop thinking what will be possible 20 years from now!

Going back to that article, the author states a very critical aspect of the HGP that I copy in the following paragraph:

“The HGP has huge potential benefits for medicine and our understanding of human diversity and origins. But a blizzard of misleading rhetoric surrounded the project, contributing to the widespread and sometimes dangerous misunderstandings about genes that now bedevils the genomic age.“ Phillip Ball |

This project has been surrounded by lots of media attention and, as with many scientific communications, one of the things that concerns me was the expected future benefits from this sequencing project. I am not going into detail about all the promises stated (see original article for more details) but what is alarming is that this genome was sold as a ‘_book of instructions_’ and ‘_nature’s complete genetic blueprint for building a human being_’.

“Misleading rhetoric has fuelled the belief that our genetic code is an ‘instruction book’ – but it’s far more interesting than that…” Phillip Ball |

It is true that the genomic information is critical for understanding things such as gene diversity, propensity to diseases and deleterious mutations. Moreover, nowadays genomic information is used to make genomic predictions, for example **polygenic risk scores** for humans and **genomic breeding values** for plants and animals. But the big fallacy is that a genotype is a ‘**blueprint**’. Here, I disagree as a genotype without a phenotype does not go very far!

All achievements in genomics, current and future, come from a close connection between **phenotypic** data and **genotypic** data. For example, genome-wide association (GWAS) used for finding markers to provide early detection of cancer relies on having thousands of individuals (or samples) identified at the different states of the disease. Likewise, for vegetables, identifying SNP markers for increased supermarket shelf-life requires phenotypic data as they age.

In my view, the main fallacy in many over-promising genomic projects, including HGP, is the belief that genomics is all you need, reflecting a lack of understanding on how critical phenotypic data is. I have even encountered, among breeding managers, the statement that ‘*phenotypic data and field testing are no longer required if we have genomic data.’!* Genes alone are thousands of small pieces of information, and there are so many complex aspects to consider, such as genes interacting with environment, and other high-order interactions at the gene level (such as dominance and epistasis), that can only be identified and understood with the use of complex computational tools paired with data on each genotype.

“Life is not a readout of genes – it’s a far more interesting, subtle and contingent process than that.” Phillip Ball |

Genomic data will stay with us for a long time. It has, and will become, cheaper to obtain and at some point it will be treated as a commodity. But good records that evaluate hundreds, thousands or even millions of unique individuals, is expensive and slow to obtain. To highlight a couple of things about this: firstly, increasing precision of phenotyping data will increase heritability; larger heritability values translate into better models, and therefore a higher chance of finding the actual true causal marker of a disease. Secondly, many breeding programs have large quantities of historical data. Often, for this data it is easier today to invest in genotyping than phenotyping, especially if DNA samples have been stored (as with semen or milt in animal breeding), or with DNA directly available from field trials (as with forest breeding programs); therefore, in these cases investment on phenotyping has already been done!

Interestingly, the statistical tools that focus on phenotyping data, are not as ‘sexy’ as the genotyping tools. Here we talk about boring aspects such as: replication, blocking, randomization, and then regression analysis, linear (mixed) models, logistic regression, etc. But all these tools are well known and understood, and there is no excuse to ignore our statistical heritage.

Furthermore, statistical tools such as ASReml-R or Genstat are critical for understanding genotype versus phenotype. In a project such as the HGP, 20 years ago only some doors were opened, and as we collect more and more information, there will be many statistical (and computational) challenges, and we will need to develop new techniques that will make all of those promises from the HGP possible; albeit always closely connected to good phenotypic data, otherwise, this will be a waste of our time!

Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world.

Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook “_Statistical Methods in Biology: Design and Analysis of Experiments and Regression_”.

Related Reads

The VSNi Team

3 months agoA way to decide whether to reject the null hypothesis (H0) against our alternative hypothesis (H1) is to determine the probability of obtaining a test statistic at least as extreme as the one observed under the assumption that H0 is true. This probability is referred to as the “p-value”. It plays an important role in statistics and is critical in most biological research. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/blog_p_value_7e04a8f8c5.png) #### **What is the true meaning of a p-value and how should it be used?** P-values are a continuum (between 0 and 1) that provide a measure of the **strength of evidence** against H0. For example, a value of 0.066, will indicate that there is a probability that we could observe values as large or larger than our critical value with a probability of 6.6%. Note that this p-value is NOT the probability that our alternative hypothesis is correct, it is only a measure of how likely or unlikely we are to observe these extreme events, under repeated sampling, in reference to our calculated value. Also note that this p-value is obtained based on an assumed distribution (e.g., t-distribution for a t-test); hence, p-value will depend strongly on your (correct or incorrect) assumptions. The smaller the p-value, the stronger the evidence for rejecting H0. However, it is difficult to determine what a small value really is. This leads to the typical guidelines of: p \< 0.001 indicating very strong evidence against H0, p \< 0.01 strong evidence, p \< 0.05 moderate evidence, p \< 0.1 weak evidence or a trend, and p ≥ 0.1 indicating insufficient evidence \[1\], and a strong debate on what this threshold should be. But declaring p-values as being either significant or non-significant based on an arbitrary cut-off (e.g. 0.05 or 5%) should be avoided. As [Ronald Fisher](https://mathshistory.st-andrews.ac.uk/Biographies/Fisher/) said: “No scientific worker has a fixed level of significance at which, from year to year, and in all circumstances he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” \[2\]. A very important aspect of the p-value is that it **does not** provide any evidence in support of H0 – it only quantifies evidence against H0. That is, a large p-value does not mean we can accept H0. Take care not to fall into the trap of accepting H0! Similarly, a small p-value tells you that rejecting H0 is plausible, and not that H1 is correct! For useful conclusions to be drawn from a statistical analysis, p-values should be considered alongside the **size of the effect**. Confidence intervals are commonly used to describe the size of the effect and the precision of its estimate. Crucially, statistical significance does not necessarily imply practical (or biological) significance. Small p-values can come from a large sample and a small effect, or a small sample and a large effect. It is also important to understand that the size of a p-value depends critically on the sample size (as this affects the shape of our distribution). Here, with a very very large sample size, H0 may be always rejected even with extremely small differences, even if H0 is nearly (i.e., approximately) true. Conversely, with very small sample size, it may be nearly impossible to reject H0 even if we observed extremely large differences. Hence, p-values need to also be interpreted in relation to the size of the study. #### References \[1\] Ganesh H. and V. Cave. 2018. _P-values, P-values everywhere!_ New Zealand Veterinary Journal. 66(2): 55-56. \[2\] Fisher RA. 1956. _Statistical Methods and Scientific Inferences_. Oliver and Boyd, Edinburgh, UK.

The VSNi Team

3 months agoIt is widely acknowledged that the most fundamental developments in statistics in the past 60+ years are driven by information technology (IT). We should not underestimate the importance of pen and paper as a form of IT but it is since people start using computers to do statistical analysis that we really changed the role statistics plays in our research as well as normal life. In this blog we will give a brief historical overview, presenting some of the main general statistics software packages developed from 1957 onwards. Statistical software developed for special purposes will be ignored. We also ignore the most widely used ‘software for statistics’ as Brian Ripley (2002) stated in his famous quote: “Let’s not kid ourselves: the most widely used piece of software for statistics is Excel.” Our focus is some of the packages developed by statisticians for statisticians, which are still evolving to incorporate the latest development of statistics. ### **Ronald Fisher’s Calculating Machines** Pioneer statisticians like [Ronald Fisher](https://www.britannica.com/biography/Ronald-Aylmer-Fisher) started out doing their statistics on pieces of paper and later upgraded to using calculating machines. Fisher bought the first Millionaire calculating machine when he was heading Rothamsted Research’s statistics department in the early 1920s. It cost about £200 at that time, which is equivalent in purchasing power to about £9,141 in 2020. This mechanical calculator could only calculate direct product, but it was very helpful for the statisticians at that time as Fisher mentioned: "Most of my statistics has been learned on the machine." The calculator was heavily used by Fisher’s successor [Frank Yates](https://mathshistory.st-andrews.ac.uk/Biographies/Yates/) (Head of Department 1933-1968) and contributed to much of Yates’ research, such as designs with confounding between treatment interactions and blocks, or split plots, or quasi-factorials. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/Frank_Yates_c50a5fbf55.jpg) _Frank Yates_ Rothamsted Annual Report for 1952: "The analytical work has again involved a very considerable computing effort." ### **Beginning of the Computer Age** From the early 1950s we entered the computer age. The computer at this time looked little like its modern counterpart, whether it was an Elliott 401 from the UK or an IBM 700/7000 series in the US. Although the first documented statistical package, BMDP, was developed starting in 1957 for IBM mainframes at the UCLA Health Computing Facility, on the other side of the Atlantic Ocean statisticians at [Rothamsted Research](https://www.rothamsted.ac.uk/) began their endeavours to program on an Elliot 401 in 1954. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/Elliott_NRDC_401_computer_b39fd1bbe3.jpg) ### **Programming Statistical Software** When we teach statistics in schools or universities, students very often complain about the difficulties of programming. Looking back at programming in the 1950s will give modern students an appreciation of how easy programming today actually is! An Elliott 401 served one user at a time and requested all input on paper tape (forget your keyboard and intelligent IDE editor). It provided the output to an electric typewriter. All programming had to be in machine code with the instructions and data on a rotating disk with 32-bit word length, 5 "words" of fast-access store, 7 intermediate access tracks of 128 words, 16 further tracks selectable one at a time (= 2949 words – 128 for system). ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/computer_paper_tape_99626ba274.jpg) _Computer paper tape_ fitting constants to main effects and interactions in multi-way tables (1957), regression and multiple regression (1956), fitting many standard curves as well as multivariate analysis for latent roots and vectors (1955). Although it sounds very promising with the emerging of statistical programs for research, routine statistical analyses were also performed and these still represented a big challenge, at least computationally. For example, in 1963, which was the last year with the [Elliott 401](https://www.ithistory.org/db/hardware/elliott-brothers-london-ltd/elliott-401) and [Elliott 402](https://www.ithistory.org/db/hardware/elliott-brothers-london-ltd/elliott-402) computers, Rothamsted Research statisticians analysed 14,357 data variables, and this took them 4,731 hours to complete the job. It is hard to imagine the energy consumption as well as the amount of paper tape used for programming. Probably the paper tape (all glued together) would be long enough to circle the equator. ### **Development of Statistical Software: Genstat, SAS, SPSS** The above collection of programs was mainly used for agricultural research at Rothamsted and was not given an umbrella name until John Nelder became Head of the Statistics Department in 1968. The development of Genstat (General Statistics) started from that year and the programming was done in FORTRAN, initially on an IBM machine. In that same year, at North Carolina State University, SAS (Statistical Analysis Software) was almost simultaneously developed by computational statisticians, also for analysing agricultural data to improve crop yields. At around the same time, social scientists at the University of Chicago started to develop SPSS (Statistical Package for the Social Sciences). Although the three packages (Genstat, SAS and SPSS) were developed for different purposes and their functions diverged somewhat later, the basic functions covered similar statistical methodologies. The first version of SPSS was released in 1968. In 1970, the first version of Genstat was released with the functions of ANOVA, regression, principal components and principal coordinate analysis, single-linkage cluster analysis and general calculations on vectors, matrices and tables. The first version of SAS, SAS 71, was released and named after the year of its release. The early versions of all three software packages were written in FORTRAN and designed for mainframe computers. Since the 1980s, with the breakthrough of personal computers, a second generation of statistical software began to emerge. There was an MS-DOS version of Genstat (Genstat 4.03) released with an interactive command line interface in 1980. ![alt text](https://web-global-media-storage-production.s3.eu-west-2.amazonaws.com/MSDOS_Genstat_4_03_619aab193a.jpg) _Genstat 4.03 for MSDOS_ Around 1985, SAS and SPSS also released a version for personal computers. In the 1980s more players entered this market: STATA was developed from 1985 and JMP was developed from 1989. JMP was, from the very beginning, for Macintosh computers. As a consequence, JMP had a strong focus on visualization as well as graphics from its inception. ### **The Rise of the Statistical Language R** The development of the third generation of statistical computing systems had started before the emergence of software like Genstat 4.03e or SAS 6.01. This development was led by John Chambers and his group in Bell Laboratories since the 1970s. The outcome of their work is the S language. It had been developed into a general purpose language with implementations for classical as well as modern statistical inferences. S language was freely available, and its audience was mainly sophisticated academic users. After the acquisition of S language by the Insightful Corporation and rebranding as S-PLUS, this leading third generation statistical software package was widely used in both theoretical and practical statistics in the 1990s, especially before the release of a stable beta version of the free and open-source software R in the year 2000. R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently widely used by statisticians in academia and industry, together with statistical software developers, data miners and data analysts. Software like Genstat, SAS, SPSS and many other packages had to deal with the challenge from R. Each of these long-standing software packages developed an R interface R or even R interpreters to anticipate the change of user behaviour and ever-increasing adoption of the R computing environment. For example, SAS and SPSS have some R plug-ins to talk to each other. VSNi’s ASReml-R software was developed for ASReml users who want to run mixed model analysis within the R environment, and at the present time there are more ASReml-R users than ASReml standalone users. Users who need reliable and robust mixed effects model fitting adopted ASReml-R as an alternative to other mixed model R packages due to its superior performance and simplified syntax. For Genstat users, msanova was also developed as an R package to provide traditional ANOVA users an R interface to run their analysis. ### **What’s Next?** We have no clear idea about what will represent the fourth generation of statistical software. R, as an open-source software and a platform for prototyping and teaching has the potential to help this change in statistical innovation. An example is the R Shiny package, where web applications can be easily developed to provide statistical computing as online services. But all open-source and commercial software has to face the same challenges of providing fast, reliable and robust statistical analyses that allow for reproducibility of research and, most importantly, use sound and correct statistical inference and theory, something that Ronald Fisher will have expected from his computing machine!

Dr. John Rogers

4 months agoEarlier this year I had an enquiry from Carey Langley of VSNi as to why I had not renewed my Genstat licence. The truth was simple – I have decided to fully retire after 50 years as an agricultural entomologist / applied biologist / consultant. This prompted some reflections about the evolution of bioscience data analysis that I have experienced over that half century, a period during which most of my focus was the interaction between insects and their plant hosts; both how insect feeding impacts on plant growth and crop yield, and how plants impact on the development of the insects that feed on them and on their natural enemies. ### Where it began – paper and post My journey into bioscience data analysis started with undergraduate courses in biometry – yes, it was an agriculture faculty, so it was biometry not statistics. We started doing statistical analyses using full keyboard Monroe calculators (for those of you who don’t know what I am talking about, you can find them [here)](http://www.johnwolff.id.au/calculators/Monroe/Monroe.htm). It was a simpler time and as undergraduates we thought it was hugely funny to divide 1 by 0 until the blue smoke came out… After leaving university in the early 1970s, I started working for the Agriculture Department of an Australian state government, at a small country research station. Statistical analysis was rudimentary to say the least. If you were motivated, there was always the option of running analyses yourself by hand, given the appearance of the first scientific calculators in the early 1970s. If you wanted a formal statistical analysis of your data, you would mail off a paper copy of the raw data to Biometry Branch… and wait. Some months later, you would get back your ANOVA, regression, or whatever the biometrician thought appropriate to do, on paper with some indication of what treatments were different from what other treatments. Dose-mortality data was dealt with by manually plotting data onto probit paper. ### Enter the mainframe In-house ANOVA programs running on central mainframes were a step forward some years later as it at least enabled us to run our own analyses, as long as you wanted to do an ANOVA…. However, it also required a 2 hours’ drive to the nearest card reader, with the actual computer a further 1000 kilometres away.… The first desktop computer I used for statistical analysis was in the early 1980s and was a CP/M machine with two 8-inch floppy discs with, I think, 256k of memory, and booting it required turning a key and pressing the blue button - yes, really! And about the same time, the local agricultural economist drove us crazy extolling the virtues of a program called Lotus 1-2-3! Having been brought up on a solid diet of the classic texts such as Steele and Torrie, Cochran and Cox and Sokal and Rohlf, the primary frustration during this period was not having ready access to the statistical analyses you knew were appropriate for your data. Typical modes of operating for agricultural scientists in that era were randomised blocks of various degrees of complexity, thus the emphasis on ANOVA in the software that was available in-house. Those of us who also had less-structured ecological data were less well catered for. My first access to a comprehensive statistics package was during the early to mid-1980s at one of the American Land Grant universities. It was a revelation to be able to run virtually whatever statistical test deemed necessary. Access to non-linear regression was a definite plus, given the non-linear nature of many biological responses. As well, being able to run a series of models to test specific hypotheses opened up new options for more elegant and insightful analyses. Looking back from 2021, such things look very trivial, but compared to where we came from in the 1970s, they were significant steps forward. ### Enter Genstat My first exposure to Genstat, VSNi’s stalwart statistical software package, was Genstat for Windows, Third Edition (1997). Simple things like the availability of residual plots made a difference for us entomologists, given that much of our data had non-normal errors; it took the guesswork out of whether and what transformations to use. The availability of regressions with grouped data also opened some previously closed doors. After a deviation away from hands-on research, I came back to biological-data analysis in the mid-2000s and found myself working with repeated-measures and survival / mortality data, so ventured into repeated-measures restricted maximum likelihood analyses and generalised linear mixed models for the first time (with assistance from a couple of Roger Payne’s training courses in Hobart and Queenstown). Looking back, it is interesting how quickly I became blasé about such computationally intensive analyses that would run in seconds on my laptop or desktop, forgetting that I was doing ANOVAs by hand 40 years earlier when John Nelder was developing generalised linear models. How the world has changed! ### Partnership and support Of importance to my Genstat experience was the level of support that was available to me as a Genstat licensee. Over the last 15 years or so, as I attempted some of these more complex analyses, my aspirations were somewhat ahead of my abilities, and it was always reassuring to know that Genstat Support was only ever an email away. A couple of examples will flesh this out. Back in 2008, I was working on the relationship between insect-pest density and crop yield using R2LINES, but had extra linear X’s related to plant vigour in addition to the measure of pest infestation. A support-enquiry email produced an overnight response from Roger Payne that basically said, “Try this”. While I slept, Roger had written an extension to R2LINES to incorporate extra linear X’s. This was later incorporated into the regular releases of Genstat. This work led to the clearer specification of the pest densities that warranted chemical control in soybeans and dry beans ([https://doi.org/10.1016/j.cropro.2009.08.016](https://doi.org/10.1016/j.cropro.2009.08.016) and [https://doi.org/10.1016/j.cropro.2009.08.015](https://doi.org/10.1016/j.cropro.2009.08.015)). More recently, I was attempting to disentangle the effects on caterpillar mortality of the two Cry insecticidal proteins in transgenic cotton and, while I got close, I would not have got the analysis to run properly without Roger’s support. The data was scant in the bottom half of the overall dose-response curves for both Cry proteins, but it was possible to fit asymptotic exponentials that modelled the upper half of each curve. The final double-exponential response surface I fitted with Roger’s assistance showed clearly that the dose-mortality response was stronger for one of the Cry proteins than the other, and that there was no synergistic action between the two proteins ([https://doi.org/10.1016/j.cropro.2015.10.013](https://doi.org/10.1016/j.cropro.2015.10.013)) ### The value of a comprehensive statistics packag**e** One thing that I especially appreciate about having access to a comprehensive statistics package such as Genstat is having the capacity to tease apart biological data to get at the underlying relationships. About 10 years ago, I was asked to look at some data on the impact of cold stress on the expression of the Cry2Ab insecticidal protein in transgenic cotton. The data set was seemingly simple - two years of pot-trial data where groups of pots were either left out overnight or protected from low overnight temperatures by being moved into a glasshouse, plus temperature data and Cry2Ab protein levels. A REML analysis, and some correlations and regressions enabled me to show that cold overnight temperatures did reduce Cry2Ab protein levels, that the effects occurred for up to 6 days after the cold period and that the threshold for these effects was approximately 14 Cº ([https://doi.org/10.1603/EC09369](https://doi.org/10.1603/EC09369)). What I took from this piece of work is how powerful a comprehensive statistics package can be in teasing apart important biological insights from what was seemingly very simple data. Note that I did not use any statistics that were cutting edge, just a combination of REML, correlation and regression analyses, but used these techniques to guide the dissection of the relationships in the data to end up with an elegant and insightful outcome. ### Final reflections Looking back over 50 years of work, one thing stands out for me: the huge advances that have occurred in the statistical analysis of biological data has allowed much more insightful statistical analyses that has, in turn, allowed biological scientists to more elegantly pull apart the interactions between insects and their plant hosts. For me, Genstat has played a pivotal role in that process. I shall miss it. **Dr John Rogers** Research Connections and Consulting St Lucia, Queensland 4067, Australia Phone/Fax: +61 (0)7 3720 9065 Mobile: 0409 200 701 Email: [john.rogers@rcac.net.au](mailto:john.rogers@rcac.net.au) Alternate email: [D.John.Rogers@gmail.com](mailto:D.John.Rogers@gmail.com)