Extending linear models to accommodate non-Normal data and random effects

# Extending linear models to accommodate non-Normal data and random effects: LM → GLM → GLMM → HGLM

Dr. Vanessa Cave

03 May 2023

In an earlier blog (ANOVA, LM, LMM, GLM, GLMM, HGLM? Which statistical method should I use?) a simple diagram was presented with the aim of helping you decide which statistical model is appropriate for your data. In this follow-up blog, we’ll delve a little deeper and explore the relationships between the models: linear model (LM), generalized linear model (GLM), generalized linear mixed model (GLMM) and hierarchical generalized linear model (HGLM).

#### Linear model

 LMs can be used to model Normal data with a single source of random variation

The linear model is:

where:

• is the vector containing the observed response values , assumed to be Normally distribution with mean  and variance
• is the vector of mean responses predicted by the model (i.e.,  is the expected value of observation )
• is the vector of residuals (i.e., the random error), assumed to have a Normal distribution with mean 0 and variance

and

• the mean, , is modelled by a of explanatory variables, i.e.,

where , , …, are the regression coefficients (i.e., parameters) associated with the explanatory variables , ,  …, , respectively. In matrix form, this mean model can be written more succinctly as:

where  is the model matrix for the explanatory variables, and  is a vector containing their regression coefficients.

#### Simple example of a linear model

Modelling diastolic blood pressure using age as a predictor (i.e., explanatory variable).

#### Generalized linear model

 GLMs extend linear models to accommodate data from non-Normal distributions

In a generalized linear model, the expected value of  is still  but…

1. can now come from any distribution from the exponential family. In addition to the Normal distribution, this includes (amongst others) the binomial, Poisson, gamma, inverse-Normal, multinomial, negative-binomial, geometric, exponential and Bernoulli distributions.

and, importantly,

2. the underlying linear model now defines a , i.e.,

which is related to the mean response, , via a  :

Notice that the link function defines the transformation required to make the model linear.

Due to its special properties, often the canonical link function for the distribution of is used. However, sometimes there are good reasons to use a different link. For example, for binomial data, the canonical link function is the logit; however, for scientific reasons, the probit link or complementary-log-log link might be more appropriate. The canonical link functions are:

#### Simple example of a generalized linear model

Modelling the number of students awake at the end of a lecture (i.e., binomial data) using the duration of the lecture (in minutes) as a predictor, and a logit link function.

Predicted proportion of students awake at the end of the lecture on the scale of the linear predictor (i.e., logit scale) = 3 - 0.07 × Duration

Plotted on the original scale (i.e., as proportions):

#### Generalized linear mixed model

 GLMMs extend generalized linear models to allow for more than one source of random variation (i.e., random effects)

Once again, the expected value of  is  and, as for generalized linear models, the underlying linear model defines the linear predictor but… the linear predictor is extended to include one or more . That is, the linear predictor is:

where is the model matrix for the  random term, and  corresponds to its vector of random effects. By allowing for random terms, data with additional sources of random variation, such as block effects, can be modelled.

In a generalized linear mixed model, the random effects corresponding to the random term  are assumed to come from a Normal distribution with mean 0 and variance .

#### Simple example of a generalized linear model

Modelling the number of nematodes in a plot, after treatment with one of four different fumigants, from a trial with a randomized complete block design.

Response variable, y: Nematodes

Assumed distribution of y: Poisson

Explanatory (fixed) terms: factor Fumigant with 4 levels (i.e., generating an X matrix with 4 columns)

Random terms: factor Block with 10 levels (i.e., the linear predictor contains a single Z matrix with 10 columns)

#### Hierarchical generalized linear model

 HGLMs extend generalized linear mixed models to allow for the random effects to follow a non-Normal distribution.

Just as for the three earlier modelling frameworks, the expected value of  is . And, as for generalized linear mixed models, the linear predictor, , can include random terms but… these additional random terms aren’t constrained to follow just a Normal distribution nor to have an identity link. That is, the linear predictor is:

where the random terms now have their own link function:

and the vectors of random effects  can follow a non-Normal distribution (e.g., beta, gamma, inverse gamma).

As it’s algorithmically and intuitively appealing, often the conjugate distribution to the distribution of the response variable, , is used for the random effects, e.g.:

#### Simple example of a hierarchical generalized model

As above, modelling the number of nematodes in a plot, after treatment with one of four different fumigants, from a trial with a randomized complete block design but, this time, with a gamma distribution for the random block effects and a natural logarithm link, h().

Want to fit a LM, GLM, GLMM or HGLM? Genstat offers comprehensive and user-friendly menus for fitting these models and outputting results.