Extending linear models to accommodate non-Normal data and random effects

Extending linear models to accommodate non-Normal data and random effects: LM → GLM → GLMM → HGLM

Dr. Vanessa Cave

03 May 2023
image_blog

In an earlier blog (ANOVA, LM, LMM, GLM, GLMM, HGLM? Which statistical method should I use?) a simple diagram was presented with the aim of helping you decide which statistical model is appropriate for your data. In this follow-up blog, we’ll delve a little deeper and explore the relationships between the models: linear model (LM), generalized linear model (GLM), generalized linear mixed model (GLMM) and hierarchical generalized linear model (HGLM).

Linear model

LMs can be used to model Normal data with a single source of random variation 

The linear model is:

where:

  • is the vector containing the observed response values , assumed to be Normally distribution with mean  and variance 
  • is the vector of mean responses predicted by the model (i.e.,  is the expected value of observation )
  • is the vector of residuals (i.e., the random error), assumed to have a Normal distribution with mean 0 and variance

and

  • the mean, , is modelled by a of explanatory variables, i.e.,

where , , …, are the regression coefficients (i.e., parameters) associated with the explanatory variables , ,  …, , respectively. In matrix form, this mean model can be written more succinctly as:

where  is the model matrix for the explanatory variables, and  is a vector containing their regression coefficients.

Simple example of a linear model

Modelling diastolic blood pressure using age as a predictor (i.e., explanatory variable).

alt text

Generalized linear model

GLMs extend linear models to accommodate data from non-Normal distributions 

In a generalized linear model, the expected value of  is still  but…

1. can now come from any distribution from the exponential family. In addition to the Normal distribution, this includes (amongst others) the binomial, Poisson, gamma, inverse-Normal, multinomial, negative-binomial, geometric, exponential and Bernoulli distributions.

and, importantly, 

2. the underlying linear model now defines a , i.e.,

which is related to the mean response, , via a  :

Notice that the link function defines the transformation required to make the model linear. 

Due to its special properties, often the canonical link function for the distribution of is used. However, sometimes there are good reasons to use a different link. For example, for binomial data, the canonical link function is the logit; however, for scientific reasons, the probit link or complementary-log-log link might be more appropriate. The canonical link functions are:

alt text

Simple example of a generalized linear model

Modelling the number of students awake at the end of a lecture (i.e., binomial data) using the duration of the lecture (in minutes) as a predictor, and a logit link function.

alt text

Predicted proportion of students awake at the end of the lecture on the scale of the linear predictor (i.e., logit scale) = 3 - 0.07 × Duration

alt text

Plotted on the original scale (i.e., as proportions):

alt text

Generalized linear mixed model

GLMMs extend generalized linear models to allow for more than one source of random variation (i.e., random effects)

Once again, the expected value of  is  and, as for generalized linear models, the underlying linear model defines the linear predictor but… the linear predictor is extended to include one or more . That is, the linear predictor is:

where is the model matrix for the  random term, and  corresponds to its vector of random effects. By allowing for random terms, data with additional sources of random variation, such as block effects, can be modelled.

In a generalized linear mixed model, the random effects corresponding to the random term  are assumed to come from a Normal distribution with mean 0 and variance .

Simple example of a generalized linear model

Modelling the number of nematodes in a plot, after treatment with one of four different fumigants, from a trial with a randomized complete block design.

alt text

Response variable, y: Nematodes

Assumed distribution of y: Poisson

Link function, g(): Natural logarithm

Explanatory (fixed) terms: factor Fumigant with 4 levels (i.e., generating an X matrix with 4 columns)

Random terms: factor Block with 10 levels (i.e., the linear predictor contains a single Z matrix with 10 columns)

alt text
alt text

Hierarchical generalized linear model

HGLMs extend generalized linear mixed models to allow for the random effects to follow a non-Normal distribution.

Just as for the three earlier modelling frameworks, the expected value of  is . And, as for generalized linear mixed models, the linear predictor, , can include random terms but… these additional random terms aren’t constrained to follow just a Normal distribution nor to have an identity link. That is, the linear predictor is:

where the random terms now have their own link function:

and the vectors of random effects  can follow a non-Normal distribution (e.g., beta, gamma, inverse gamma).

As it’s algorithmically and intuitively appealing, often the conjugate distribution to the distribution of the response variable, , is used for the random effects, e.g.:

alt text

Simple example of a hierarchical generalized model

As above, modelling the number of nematodes in a plot, after treatment with one of four different fumigants, from a trial with a randomized complete block design but, this time, with a gamma distribution for the random block effects and a natural logarithm link, h().

alt text

Want to fit a LM, GLM, GLMM or HGLM? Genstat offers comprehensive and user-friendly menus for fitting these models and outputting results.

Summary

alt text

About the author

Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems.  Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

Vanessa is a past President of both the Australasian Region of the International Biometric Society and the New Zealand Statistical Association, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.