Ruth Butler

07 December 2023Data transformation and GLM are methods used to address the problem of data not conforming to assumptions required for a valid analysis. The primary things they both address are (i) variances vary between the treatments (the data are not homoscedastic), and (ii) the data is not normal, for example it is counts or percentages. Non-normality tends to be associated with a lack of homoscedasticity.

The two common methods to deal with a lack of homoscedasticity and non-normality are to transform the data, or to use a generalized linear model (GLM).

A transformation and a link function in a GLM look very similar, but switching between them has important consequences, particularly for the interpretation of the analysis results.

**Data Transformation**

When data is transformed, every data point is changed before the analysis is carried out. The explanatory variables are then directly related to the transformed data. For example, with a square-root transform, the mean MT obtained from the analysis is:

MT = mean(√(data)) = {explanatory variables}

And, with a log transform, the mean MT is:

MT = mean(log(data)) = {explanatory variables}

For the log. transform, a bit of algebra shows that MT back-transformed (i.e., exp(MT)) is the ‘geometric mean’. For n data points, the geometric mean is the nth root (i.e., to the power of 1/n) of the product of the data points. The geometric mean can be quite different from the usual (arithmetic) mean, which affects how the analysis results are interpreted.

**Examples:**

**Links in a GLM**

A key part of a generalized linear model is the link function. This relates the explanatory variables to the mean of the data M:

log(M) = log(mean(data)) = {explanatory variables}

This equation looks very similar to the one above for MT, but the swapping around of the mean and transformation functions leads to different results: ‘The log of the mean is not the same as the mean of the log’. Back-transforming log(M) gives M, which is the standard (arithmetic) mean.

**What is the consequence of transforming data?**

Transforming data does not only result in a different type of average, it also changes four other things. The five changes are illustrated for a log-transformation:

*1. Estimates of the treatment ‘mean’:*

- ANOVA of the log(data) gives the means of the
*log*of the data, which when back-transformed, as described above, are geometric means.

2. *Distribution* of the analysed data:

- If the data for each treatment are log-normally distributed, then the distribution of the logged data is normally distributed.

*3. Mean: Variance relationship* (the most important effect):

- For log-normally distributed data, the variance increases with increasing treatment means. The variance of the log(data) is often constant across treatments.

*4. Estimates of precision of the ‘means’:*

- For the log transform, confidence limits or a Least Significant Difference (LSD*) can be meaningfully back-transformed. A back-transformed LSD is a ‘Least Significant Ratio’ (smallest ratio of a larger mean/smaller mean for the two means to differ significantly). However, standard errors cannot meaningfully be back-transformed. (For more information see "Back-transforming standard errors using the Delta method”)
**smallest difference between two means such that they are significantly different at the required significance level.*

*5. The type of comparison between treatments:*

- ANOVA of the log(data) is comparing the differences between the means of the log(data). Back-transforming this difference gives the
*ratio*between the*geometric means*. This does not directly say anything about differences between the raw means:

- Untransformed data: MeanA−MeanB
- Logged data: log(GeoMeanA)−log(GeoMeanB)
- which, when back transformed, is GeoMeanA/GeoMeanB.

For many transformations, the means and precision cannot be back-transformed to give meaningful or interpretable results. Historically, some practitioners addressed the problem of the back-transformed ‘mean’ from the analysis not equalling the standard arithmetic mean by using ‘bias adjusted back-transforms’ *(For an example, see* *here**)*. This approach is really not needed now, given we have GLMs.

In contrast to using a transformation, a generalized linear model allows the above five changes to be at least partially addressed:

- Each distribution is associated with a particular mean-variance relationship, so selecting a particular distribution is effectively selecting a mean-variance relationship. The variance in a GLM is fV(m), where m is the mean, f is called the ‘dispersion’. V(m) is an equation relating the variance to the mean, called the ‘Variance function’. The most commonly used distributions and their associated variance functions are given in Table 1.
- The link is selected independently of the distribution: this is a major advantage for GLM over data transformation. The choice of link determines the relationship between the estimates (means…) and the explanatory variables. As for data transformation, choosing a log link results in a multiplicative relationship between the explanatory variables and the mean. However, in contrast to the log data transform, the mean here is the standard (arithmetic) mean rather than the geometric mean as for a log data transform.

**Table****1: The most commonly used distributions, their variance function, dispersion and canonical links.**

All distributions have a ‘canonical’ link, which, when used, has special properties. In many cases, the canonical link makes the most sense to use. However, some distributions are frequently used with an alternative link, because the alternative link makes more sense scientifically and can be easier to interpret.

**So- how should I choose between transformation and a GLM?**

A core consideration is how interpretable the analysis results are, especially when comparing means.

Some traditional transforms can make interpretation quite hard, with no meaningful interpretation for back-transformed means or measures of precision. For example:

- Arc-sine for percentages calculated from counts (David Warton’s paper has extensive reasons to not use this transform)
- Instead, use a binomial GLM with either the logit, probit or complementary-log-log link
- Square-root, for count data
- Instead use a Poisson GLM, with log link

There are some cases where a data-transformation makes very good sense. For example, if you have areas, a square-root transform essentially converts the data to lengths. If you have volumes or weights, a cube-root transform may be appropriate (see Welham, chapter 6). Data such as weights may also be close to log-normally distributed, so a log transformation may be suitable, bearing in mind this gives geometric rather than arithmetic means.

Data-transformation and GLM can both give valid analyses for the same set of data. Thus, the choice between these is really about how easily the analysis can be carried out, and how easily the results can be interpreted. With complex trials that have many strata or random effects, it may be easier to use ANOVA on transformed data because extensions of GLM that include random effects (GLMM, HGLM) can be more challenging to fit.

**References:**

Bias adjusted back-transforms: Box, G.E.P. & Cox, D.R. 1964. An Analysis of Transformations. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)* **26**(2), 211-252.

Welham, S.J., Gezan, S.A., Clark, S.J. & Mead, A. 2015. *Statistical methods in biology: design and analysis of experiments and regression*. CRC Press, Boca Raton, Florida, Pp 563+xiii.

Warton, D.I. & Hui, F.K.C. 2011. The arcsine is asinine: the analysis of proportions in ecology. *Ecology* **92**(1), 3-10.

**Related blogs**

ANOVA, LM, LMM, GLM, GLMM, HGLM? Which statistical method should I use?

Extending linear models to accommodate non-Normal data and random effects: LM → GLM → GLMM → HGLM Back-transforming standard errors using the Delta method

Dr. Ruth Butler has worked as a biometrician/statistical consultant for more than 35 years, initially in the UK, then from the mid-1990s in New Zealand. She has primarily worked with bio-protection scientists (plant pathology, entomology), but also has significant experience working with other non-medical biological scientists including in soils/agronomy, food research and plant breeding. Ruth has been a Genstat user throughout her career, contributing around 10 Genstat procedures, and has been a beta tester of Genstat for 30 years. Ruth has also been a CycDesigN user since the very first version was released in 1997. Her interests are in good data management practices, well-designed experiments, and in improving communication between statisticians and scientists.

Related Reads