Chapter 7 Generalized linear models

In ordinary least squares with a single predictor, we have the relationship \(\E\left[Y_{i}\right]=\alpha+\beta x_{i}\). This model asserts that the mean response

has a “baseline” (intercept) of \(\alpha\)
will change by \(\beta\) (slope) for a one-unit increase in \(x_{i}\)

Because the mean response is linear in the regression coefficients, we refer to OLS as a linear model. OLS assumes that the response \(Y\) is continuous, e.g., height. But, analytical interest often lies in responses that are not continuous, and OLS does not model these discrete responses well. In such cases, we can extend the model by assuming other distributions for \(Y\).

Following Casella and Berger (2002), a generalized linear model “describes a relationship between the mean of a response variable \(Y\) and an independent variable \(x\).” A GLM consists of three components.

The random component or distributional assumption consists of the response variables \(Y_{1},\ldots,Y_{n}\), which are assumed to be independent random variables from the same exponential family (though they are not assumed to be identically distributed). Fitzmaurice, Laird, and Ware (2012) describe the random component as “a probabilistic mechanism by which the responses are assumed to be generated.”
The systematic component is the linear regression model, i.e., a function of the predictor variables \(X_{i}\) that is linear in the parameters \(\beta_{i}\) and related to the mean of \(Y_{i}\).
The link function \(g\left(\mu\right)\) links the random and systematic components by asserting that \(g\left(\mu_{i}\right)=\mathbf{X}_{i}^{\mathsf{T}}\boldsymbol{\beta}\), where \(\mu_{i}=\E\left[Y_{i}\right]\) and \(\mathbf{X}_{i}^{\mathsf{T}}\boldsymbol{\beta}=\sum_{k=1}^{p}\beta_{k}X_{ik}\) is the systematic component.

Example 7.1 (Logistic regression) For \(Y_{1},Y_{2},\ldots,Y_{n}\), let \(Y_{i}\sim\text{Bernoulli}\left(p\right)\), i.e.,

\[ Y_{i}=\begin{cases} 0, & \text{if no event}\\ 1, & \text{if event} \end{cases}. \]

The \(Y_{i}\) are the random component. Suppose that we believe that variables \(X_{1},\ldots,X_{p}\) are related to the response \(Y\), so that the systematic component is \(\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\). We now consider the link function. The expected value of a Bernoulli random variable is its parameter \(p\), hence \(\mu_{i}=\E\left[Y_{i}\right]=p_{i}=P\left(Y_{i}=1\right)\).

If we use the identity link \(g\left(\mu\right)=\mu\), then our model is \(\mu=\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\). Depending on our predictors, we may obtain values for \(\mu\) that lie outside \(\left[0,1\right]\). Because \(p\in\left[0,1\right]\), it is not clear how to interpret such values. Accordingly, we would like to transform \(\mu\) such that it always lies in \(\left[0,1\right]\).

The standard logistic function is

\[ \sigma\left(t\right)=\frac{1}{1+\mathrm{e}^{-t}},\quad x\in\mathbb{R}. \]

Observe that

\[ \lim_{t\rightarrow\infty} \sigma\left(t\right)=\frac{1}{\lim_{t\rightarrow\infty}\left(1+\mathrm{e}^{-t}\right)} = \frac{1}{1+\lim_{t\rightarrow\infty}\mathrm{e}^{-t}}=\frac{1}{1+0}=1 \]

and

\[ \lim_{t\rightarrow -\infty}\sigma\left(t\right)=\frac{1}{1+\lim_{t\rightarrow -\infty}\mathrm{e}^{-t}}=\frac{1}{1+\infty}=0. \]

For any \(t\in\left(-\infty,\infty\right)\), we have \(\mathrm{e}^{-t}>0\), so that \(1<1+\mathrm{e}^{-t}\), hence \(\sigma\left(t\right)\) is bounded below by 0 and above by 1. The standard logistic function would thus seem to be a good candidate for our link. If we let \(t=\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\), then \(\sigma\) will map the model to \(\left[0,1\right]\), i.e.,

\[ p\left(\mathbf{X}\right)=\frac{1}{1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}}, \]

where we have used the notation \(p\) to reflect that we are modeling the probability of a success (the event of interest occurs). We have mapped the model to an interval appropriate for the mean response, but have some work left to do to put it into the correct form for a GLM. The inverse of the logistic function is the logit function, given by

\[ \text{logit}\left(t\right)=\log\frac{t}{1-t}. \]

Observe that

\[ \begin{align*} \text{logit}\left(p\left(\mathbf{X}\right)\right) & = \log\frac{p\left(\mathbf{X}\right)}{1-p\left(\mathbf{X}\right)} \\ & = \log p\left(\mathbf{X}\right)-\log\left(1-p\left(\mathbf{X}\right)\right) \\ & = \log\frac{1}{1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}}-\log\left(1-\frac{1}{1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}}\right) \\ & = \log 1-\log\left(1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}\right)-\log\left(\frac{1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}}{1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}}-\frac{1}{1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}}\right) \\ & = -\log\left(1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}\right)-\log\left(\frac{\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}}{1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}}\right) \\ & = -\log\left(1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}\right)-\left[\log\left(\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}\right)-\log\left(1+\exp\left\{-\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}\right)\right] \\ & = \mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}. \end{align*} \]

Thus, applying the logit function to the transformed mean response \(p\left(\mathbf{X}\right)\) results in an appropriate form for the systematic component of the model. Accordingly, we will take the logit as the link function, i.e., \(g\left(\mu\right)=\text{logit}\left(\mu\right)\), so that the logistic regression model is

\[ \text{logit}\left(\mu\right) = \text{logit}\left(p\right) = \log\left(\frac{p}{1-p}\right)=\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}. \]

We refer to \(p/\left(1-p\right)\) as the odds of the event (how likely versus not). We see that logistic regression is linear in the log-odds of \(Y\). When we exponentiate both sides of the above equation, we can interpret the coefficient \(\beta_{i}\) as the multiplicative change in the odds of success associated with a one-unit change in \(X_{i}\).

Finally, recall from 5.6 that \(\log\left(p/\left(1-p\right)\right)\) is the natural parameter of the binomial exponential family. When the natural parameter is used as the link function in a GLM, it is called the canonical link.

We now consider a GLM suitable for count data.

Example 7.2 (Poisson regression) For \(Y_{1},Y_{2},\ldots,Y_{n}\), let \(Y_{i}\sim\text{Poisson}\left(p\right)\).

In a Poisson regression, we have \(\log\left(\lambda\right)=\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k}\).

7.1 Interaction terms

THIS DOESN’T QUITE BELONG HERE, EVENTUALLY REARRANGE IT

The relationship among the predictors in the models specified above is additive: the effect on the response variable of two predictors \(X_{i}\) and \(X_{j}\) for \(i\neq j\) is their sum, weighted by their respective coefficients, i.e., \(\beta_{i}X_{i}+\beta_{j}X_{j}\). Suppose that we believe that the effect on the reponse of \(X_{j}\) is not independent of the value \(X_{i}\), i.e., the relationship between \(X_{i}\) and \(X_{j}\) is not additive. We can test this hypothesis by introducing an interaction term.

We will begin by considering interactions in the context of the German Credit data set from Dheeru and Karra Taniskidou (2017), which is bundled in the caret package. This data set gives the creditworthiness (Good or Bad) of 1000 customers, along with related attributes. We will model the credit class as a function of the customer’s age, whether the customer is a foreign worker, and whether the customer has a telephone number registered under his or her name. (While the data set contains many other predictors, we will use a simple model that allows us to examine a two-way interaction of binary variables.) The response variable (Class) has two levels, so (binary) logistic regression is suitable. We begin by fitting the (additive) model.

glm_fit <- glm(
  Class ~ Age + ForeignWorker + Telephone, 
  family = binomial,
  data = GermanCredit
)

summary(glm_fit)

## 
## Call:
## glm(formula = Class ~ Age + ForeignWorker + Telephone, family = binomial, 
##     data = GermanCredit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2389  -1.4298   0.8033   0.8835   0.9851  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    1.629875   0.594810   2.740  0.00614 **
## Age            0.017587   0.006495   2.708  0.00677 **
## ForeignWorker -1.347619   0.536163  -2.513  0.01196 * 
## Telephone     -0.145706   0.144578  -1.008  0.31355   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.7  on 999  degrees of freedom
## Residual deviance: 1203.9  on 996  degrees of freedom
## AIC: 1211.9
## 
## Number of Fisher Scoring iterations: 4

We can exponentiate the coefficients to obtain odds ratios. Recall that the odds of a success are given by

\[ \frac{p}{1-p}=\exp\left\{\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta}\right\}. \]

for \(\mathbf{X}\in\mathbb{R}^{n+1}\). Let \(\text{odds}\left(x_{i}\right)\) be the odds of a success when \(X_{i}=x_{i}\) and all other predictors are held fixed. Then, the odds ratio associated with a one-unit increase in \(X_{i}\) is

\[ \begin{align*} \frac{\text{odds}\left(x_{i}+1\right)}{\text{odds}\left(x_{i}\right)} & = \frac{\exp\left\{\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{i}\left(x_{i}+1\right)+\cdots+\beta_{n}x_{n}\right\}}{\exp\left\{\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{i}x_{i}+\cdots+\beta_{n}x_{n}\right\}} \\ & = \exp\left\{\beta_{i}\left(x_{i}+1\right)\right\}\exp\left\{-\beta_{i}x_{i}\right\} \\ & = \mathrm{e}^{\beta_{i}}. \end{align*} \]

Observe that for some \(k\in\mathbb{N}^{+}\), we have

\[ \begin{align*} \text{odds}\left(x_{i}+k\right) & = \exp\left\{\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{i}\left(x_{i}+k\right)+\cdots+\beta_{n}x_{n}\right\} \\ & = \exp\left\{\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{i}x_{i}+\cdots+\beta_{n}x_{n}\right\}\exp\left\{k\beta_{i}\right\} \\ & = \mathrm{e}^{k\beta_{i}}\cdot\mathbf{X}^{\mathsf{T}}\boldsymbol{\beta} \\ & = \left(\prod_{i=1}^{k}\mathrm{e}^{\beta_{i}}\right)\text{odds}\left(x_{i}\right). \end{align*} \]

We thus see that the odds of a success are multiplied by \(\mathrm{e}^{\beta_{i}}\) \(k\) times for a \(k\)-unit increase in \(X_{i}\) (again, when all other predictors are held fixed). We could also divide this expression by \(\text{odds}\left(x_{i}\right)\) to obtain the odds ratio associated with a \(k\)-unit increase in \(X_{i}\). We now consider the odds ratios for our model.

cbind(
  coef = coef(glm_fit),
  coef_exp = exp(coef(glm_fit))
)

##                      coef  coef_exp
## (Intercept)    1.62987542 5.1032389
## Age            0.01758655 1.0177421
## ForeignWorker -1.34761869 0.2598583
## Telephone     -0.14570579 0.8644120

We see that the customer’s Age and status as a ForeignWorker are statistically significant predictors of credit class at the customary \(\alpha=0.05\). A one-unit change in Age is associated with a change in odds of \(\left(1.02-1\right)\times 100\%=2\%\), i.e., for every additional year older, the odds of Good credit increase by a (multiplicative) factor of \(2\%\). Similarly, a one-unit change in ForeignWorker is associated with a change in odds of \(\left(0.26-1\right)\times 100\%=-74\%\), i.e., if the customer is a foreign worker (versus not), the odds of Good credit decrease by \(74\%\).

We observe that that whether the customer has a registered telephone number is not a significant predictor of Good credit. Now, it is plausible that having a registered telephone number is a proxy for how long one has been a resident of the country, hence that workers from abroad might tend to have telephone numbers at a rate different from that of domestic workers. We can test whether this interaction is significant by including the appropriate model term.

glm_interact_fit <- glm(
  Class ~ Age + ForeignWorker * Telephone, 
  family = binomial,
  data = GermanCredit
)

summary(glm_interact_fit)

## 
## Call:
## glm(formula = Class ~ Age + ForeignWorker * Telephone, family = binomial, 
##     data = GermanCredit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3440  -1.4251   0.8021   0.8835   1.0674  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             -0.187238   0.940721  -0.199  0.84223   
## Age                      0.017371   0.006503   2.671  0.00755 **
## ForeignWorker            0.496322   0.921310   0.539  0.59008   
## Telephone                2.294990   1.170523   1.961  0.04992 * 
## ForeignWorker:Telephone -2.472636   1.179491  -2.096  0.03605 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.7  on 999  degrees of freedom
## Residual deviance: 1199.8  on 995  degrees of freedom
## AIC: 1209.8
## 
## Number of Fisher Scoring iterations: 5

We see that the interaction term ForeignWorker:Telephone is significant, so we conclude that the effect of having a Telephone number is not independent of whether the customer is a ForeignWorker. The interpretation of the main effects for ForeignWorker and Telephone is very different in the presence of the interaction term. One way to understand the effects is to consider all possible combinations.

newdata <- with(
  GermanCredit,
  expand.grid(
    ForeignWorker = unique(ForeignWorker),
    Telephone = unique(Telephone),
    Age = mean(Age)
  )
)

pred <- predict(glm_interact_fit, newdata = newdata)

pred_odds <- cbind(
  newdata,
  log_odds = pred,
  odds = exp(pred)
)

pred_odds

##   ForeignWorker Telephone    Age  log_odds      odds
## 1             1         0 35.546 0.9265514  2.525784
## 2             0         0 35.546 0.4302298  1.537611
## 3             1         1 35.546 0.7489063  2.114686
## 4             0         1 35.546 2.7252202 15.259774

These results give the odds of having Good versus Bad credit for each combination of ForeignWorker and Telephone, holding Age constant (at its mean). Observe that the log-odds is a linear combination of the coefficients for each set of predictors. For example, the first row represents a customer who is a foreign worker, is of mean age, and does not have a telephone. Telephone is 0, so that the interaction \(\text{ForeignWorker}\times\text{Telephone}\) is 0, hence the log-odds are the sum of the coefficients for the intercept, ForeignWorker, and Age (multiplied by the mean age), i.e.,

sum(
  coef(glm_interact_fit)[c("(Intercept)", "ForeignWorker")],
  coef(glm_interact_fit)["Age"] * pred_odds[1, "Age"]
)

## [1] 0.9265514

Geometrically, including an interaction term is equivalent to testing whether the slope (the coefficient) of the Telephone predictor is different if the customer is a ForeignWorker; else we assume the slope is the same regardless of ForeignWorker status. Also notice that we can no longer interpret the coefficient for Telephone as the unique effect of having a telephone on credit class. The interaction is significant, i.e., the effect of having a telephone depends on whether the customer is a foreign worker, so we must consider the interaction term in addition to the main effect. It also follows that we cannot simply exponentiate the main effect for Telephone to obtain the change in odds of Good credit associated with having a telephone. If we wish to consider an odds ratio in this case, we must first specify whether the customer is a ForeignWorker, i.e., fix the value of that predictor. Then, we can compute the odds in both cases (telephone versus no telephone) and divide to obtain an odds ratio. If we set ForeignWorker to 0, then having a telephone is associated with an increase in the odds of Good credit by a factor of nearly 10.

pred_odds[4, 5] / pred_odds[2, 5]

## [1] 9.924342

If we set ForeignWorker to 1, however, having a telephone is associated with a slight decrease in the odds of Good credit.

pred_odds[3, 5] / pred_odds[1, 5]

## [1] 0.8372396

These very different odds ratios are consistent with our earlier finding that the interaction is significant.

References

Casella, G., and R.L. Berger. 2002. Statistical Inference. Duxbury Advanced Series in Statistics and Decision Sciences. Thomson Learning. https://books.google.com/books?id=0x\_vAAAAMAAJ.

Dheeru, Dua, and Efi Karra Taniskidou. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Fitzmaurice, Garrett M, Nan M Laird, and James H Ware. 2012. Applied Longitudinal Analysis. Vol. 998. John Wiley & Sons.