Dr. Vanessa Cave03 May 2023
In an earlier blog (ANOVA, LM, LMM, GLM, GLMM, HGLM? Which statistical method should I use?) a simple diagram was presented with the aim of helping you decide which statistical model is appropriate for your data. In this follow-up blog, we’ll delve a little deeper and explore the relationships between the models: linear model (LM), generalized linear model (GLM), generalized linear mixed model (GLMM) and hierarchical generalized linear model (HGLM).
|LMs can be used to model Normal data with a single source of random variation
The linear model is:
where , , …, are the regression coefficients (i.e., parameters) associated with the explanatory variables , , …, , respectively. In matrix form, this mean model can be written more succinctly as:
where is the model matrix for the explanatory variables, and is a vector containing their regression coefficients.
Simple example of a linear model
Modelling diastolic blood pressure using age as a predictor (i.e., explanatory variable).
|GLMs extend linear models to accommodate data from non-Normal distributions
In a generalized linear model, the expected value of is still but…
1. can now come from any distribution from the exponential family. In addition to the Normal distribution, this includes (amongst others) the binomial, Poisson, gamma, inverse-Normal, multinomial, negative-binomial, geometric, exponential and Bernoulli distributions.
2. the underlying linear model now defines a , i.e.,
which is related to the mean response, , via a :
Notice that the link function defines the transformation required to make the model linear.
Due to its special properties, often the canonical link function for the distribution of is used. However, sometimes there are good reasons to use a different link. For example, for binomial data, the canonical link function is the logit; however, for scientific reasons, the probit link or complementary-log-log link might be more appropriate. The canonical link functions are:
Simple example of a generalized linear model
Modelling the number of students awake at the end of a lecture (i.e., binomial data) using the duration of the lecture (in minutes) as a predictor, and a logit link function.
Predicted proportion of students awake at the end of the lecture on the scale of the linear predictor (i.e., logit scale) = 3 - 0.07 × Duration
Plotted on the original scale (i.e., as proportions):
|GLMMs extend generalized linear models to allow for more than one source of random variation (i.e., random effects)
Once again, the expected value of is and, as for generalized linear models, the underlying linear model defines the linear predictor but… the linear predictor is extended to include one or more . That is, the linear predictor is:
where is the model matrix for the random term, and corresponds to its vector of random effects. By allowing for random terms, data with additional sources of random variation, such as block effects, can be modelled.
In a generalized linear mixed model, the random effects corresponding to the random term are assumed to come from a Normal distribution with mean 0 and variance .
Simple example of a generalized linear model
Modelling the number of nematodes in a plot, after treatment with one of four different fumigants, from a trial with a randomized complete block design.
Response variable, y: Nematodes
Assumed distribution of y: Poisson
Link function, g(): Natural logarithm
Explanatory (fixed) terms: factor Fumigant with 4 levels (i.e., generating an X matrix with 4 columns)
Random terms: factor Block with 10 levels (i.e., the linear predictor contains a single Z matrix with 10 columns)
|HGLMs extend generalized linear mixed models to allow for the random effects to follow a non-Normal distribution.
Just as for the three earlier modelling frameworks, the expected value of is . And, as for generalized linear mixed models, the linear predictor, , can include random terms but… these additional random terms aren’t constrained to follow just a Normal distribution nor to have an identity link. That is, the linear predictor is:
where the random terms now have their own link function:
and the vectors of random effects can follow a non-Normal distribution (e.g., beta, gamma, inverse gamma).
As it’s algorithmically and intuitively appealing, often the conjugate distribution to the distribution of the response variable, , is used for the random effects, e.g.:
Simple example of a hierarchical generalized model
As above, modelling the number of nematodes in a plot, after treatment with one of four different fumigants, from a trial with a randomized complete block design but, this time, with a gamma distribution for the random block effects and a natural logarithm link, h().
Want to fit a LM, GLM, GLMM or HGLM? Genstat offers comprehensive and user-friendly menus for fitting these models and outputting results.
Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems. Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.
Vanessa is a past President of both the Australasian Region of the International Biometric Society and the New Zealand Statistical Association, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.
Dr. Valérie Poupon09 February 2024
Parental versus Animal Model: What is the difference and how do we choose?
Tim Bean23 January 2024
Data, data everywhere…but is it helping your analytics?