Make sure the following package is installed:
You can use the following code to perform the installation:
install.packages("lme4")
Now load the packages
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
library("Sleuth3")
library("lme4")
## Loading required package: Matrix
Generalized linear models are a class of models that include
as specific examples. The glm()
function allows us to
fit all of these models, although we usually use the lm()
function for the special (and important) case of linear regression.
The linear regression model can be fit using either the
glm()
or lm()
function, but running
summary()
on the lm()
function analysis is
more informative.
m_lm <- lm( Velocity ~ Distance, data = case0701)
m_glm <- glm(Velocity ~ Distance, data = case0701)
where the default family in the glm()
function is
normal
.
summary(m_lm)
##
## Call:
## lm(formula = Velocity ~ Distance, data = case0701)
##
## Residuals:
## Min 1Q Median 3Q Max
## -398.02 -157.03 -12.88 148.14 506.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.25 83.52 -0.482 0.635
## Distance 453.63 75.30 6.024 4.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 233.2 on 22 degrees of freedom
## Multiple R-squared: 0.6226, Adjusted R-squared: 0.6054
## F-statistic: 36.29 on 1 and 22 DF, p-value: 4.608e-06
summary(m_glm)
##
## Call:
## glm(formula = Velocity ~ Distance, data = case0701)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.25 83.52 -0.482 0.635
## Distance 453.63 75.30 6.024 4.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 54385.57)
##
## Null deviance: 3170091 on 23 degrees of freedom
## Residual deviance: 1196482 on 22 degrees of freedom
## AIC: 333.71
##
## Number of Fisher Scoring iterations: 2
The dispersion parameter is the estimate of the variance.
Logistic regression can be accomplished using the glm()
function and setting the family
argument to
binomial
. There are two different ways to set up the
logistic regression model fit depending on whether the data are grouped
or not.
For ungrouped data, the response can be a factor where the first level is failure and all other levels are success.
levels(case2002$LC)
## [1] "LungCancer" "NoCancer"
m <- glm(LC ~ CD,
data = case2002,
family = binomial)
m
##
## Call: glm(formula = LC ~ CD, family = binomial, data = case2002)
##
## Coefficients:
## (Intercept) CD
## 1.53541 -0.05113
##
## Degrees of Freedom: 146 Total (i.e. Null); 145 Residual
## Null Deviance: 187.1
## Residual Deviance: 179.6 AIC: 183.6
Thus, this logistic regression analysis has
success'' being
NoCancer’’.
It is better practice to be explicit. You can be explicit by using an equality and thus treating the response as TRUE/FALSE.
m <- glm(LC == "LungCancer" ~ CD,
data = case2002,
family = binomial)
m
##
## Call: glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = case2002)
##
## Coefficients:
## (Intercept) CD
## -1.53541 0.05113
##
## Degrees of Freedom: 146 Total (i.e. Null); 145 Residual
## Null Deviance: 187.1
## Residual Deviance: 179.6 AIC: 183.6
Care needs to be taken with this approach because any value other than ``LungCancer’’ will be treated as a failure, e.g. typos.
Finally, we can use a 0-1 coding where 1 is a success and 0 is failure.
d <- case2002 |>
mutate(lung_cancer = 1 * (LC == "LungCancer"))
m <- glm(LC == "LungCancer" ~ CD,
data = d,
family = binomial)
m
##
## Call: glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = d)
##
## Coefficients:
## (Intercept) CD
## -1.53541 0.05113
##
## Degrees of Freedom: 146 Total (i.e. Null); 145 Residual
## Null Deviance: 187.1
## Residual Deviance: 179.6 AIC: 183.6
To obtain a better summary use summary
().
summary(m)
##
## Call:
## glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = d)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.53541 0.37707 -4.072 4.66e-05 ***
## CD 0.05113 0.01939 2.637 0.00836 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 187.14 on 146 degrees of freedom
## Residual deviance: 179.62 on 145 degrees of freedom
## AIC: 183.62
##
## Number of Fisher Scoring iterations: 4
Fit a logistic regression model to determine the relationship between
survival in the Donner Party (case2001
) and the age and sex
of the individual.
m <- glm(Status == "Survived" ~ Age + Sex,
data = case2001,
family = binomial)
summary(m)
##
## Call:
## glm(formula = Status == "Survived" ~ Age + Sex, family = binomial,
## data = case2001)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.23041 1.38686 2.329 0.0198 *
## Age -0.07820 0.03728 -2.097 0.0359 *
## SexMale -1.59729 0.75547 -2.114 0.0345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.827 on 44 degrees of freedom
## Residual deviance: 51.256 on 42 degrees of freedom
## AIC: 57.256
##
## Number of Fisher Scoring iterations: 4
Often times data are grouped by explanatory variable values. For example, we can group the lung cancer data set by the number of cigarettes per day. The most typical way these data are stored is to have the number of observations and the proportion of them that were ``successful’’.
# Create grouped data
lung_grouped <- case2002 |>
group_by(CD) |>
summarize(n = n(),
y = sum(LC == "LungCancer"))
lung_grouped
## # A tibble: 19 × 3
## CD n y
## <int> <int> <int>
## 1 0 17 1
## 2 1 2 0
## 3 2 2 0
## 4 3 1 1
## 5 4 2 0
## 6 5 2 0
## 7 6 3 1
## 8 8 2 0
## 9 10 15 4
## 10 12 5 1
## 11 15 27 12
## 12 16 1 1
## 13 18 2 0
## 14 20 38 16
## 15 25 15 7
## 16 30 7 3
## 17 37 1 0
## 18 40 4 2
## 19 45 1 0
You can still fit the regression model using the glm()
function, but you need to using the cbind()
function as the
response where you have the first term is the number of successes and
the second term is the number of failures. So the logistic regression
model fit looks like. (Be careful not to use the c()
function or you will get a ``variable lengths differ’’ error.)
# Use cbind(successes, failures)
m <- glm(cbind(y, n-y) ~ CD,
data = lung_grouped,
family = binomial)
summary(m)
##
## Call:
## glm(formula = cbind(y, n - y) ~ CD, family = binomial, data = lung_grouped)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.53541 0.37707 -4.072 4.66e-05 ***
## CD 0.05113 0.01939 2.637 0.00836 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.651 on 18 degrees of freedom
## Residual deviance: 21.141 on 17 degrees of freedom
## AIC: 48.879
##
## Number of Fisher Scoring iterations: 4
Group the Donner party data (case2001
) by age and sex.
Then re-run the logistic regression analysis from the previous activity
on these grouped data.
donner_grouped <- case2001 |>
group_by(Age, Sex) |>
summarize(
y = sum(Status == "Survived"),
n = n())
## `summarise()` has grouped output by 'Age'. You can override using the `.groups`
## argument.
m <- glm(cbind(y, n-y) ~ Age + Sex,
data = donner_grouped,
family = binomial)
summary(m)
##
## Call:
## glm(formula = cbind(y, n - y) ~ Age + Sex, family = binomial,
## data = donner_grouped)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.23041 1.38697 2.329 0.0199 *
## Age -0.07820 0.03729 -2.097 0.0360 *
## SexMale -1.59729 0.75550 -2.114 0.0345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37.012 on 27 degrees of freedom
## Residual deviance: 26.441 on 25 degrees of freedom
## AIC: 42.347
##
## Number of Fisher Scoring iterations: 5
Recall that the Poisson distribution is a distribution for a count with no clear maximum. For Poisson regression, our response is a count with no clear maximum and we can model the mean/expectation of that count as a function of explanatory variables.
To run a Poisson regression in R, use the glm()
function
with the family
argument set to poisson
.
# Poisson regression
m <- glm(Matings ~ Age,
data = case2201,
family = poisson)
summary(m)
##
## Call:
## glm(formula = Matings ~ Age, family = poisson, data = case2201)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## Age 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
Investigate the relationship between number of salamanders as a function of forest age using a Poisson regression analysis. Data are available in `case2202’.
m <- glm(Salamanders ~ ForestAge,
data = case2202,
family = poisson)
summary(m)
##
## Call:
## glm(formula = Salamanders ~ ForestAge, family = poisson, data = case2202)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5040207 0.1401385 3.597 0.000322 ***
## ForestAge 0.0019151 0.0004155 4.609 4.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 190.22 on 46 degrees of freedom
## Residual deviance: 170.65 on 45 degrees of freedom
## AIC: 259.7
##
## Number of Fisher Scoring iterations: 6
To compare nested linear models, we used the F-test through the R
function anova()
. For GLMs, we will use a likelihood ratio
test through the same function. Asymptotically (i.e. as you have
infinite data), the statistic for this test has a chi-squared
distribution. Thus, we specify this test using the anova()
table setting the test
argument to “Chi”.
Here is a test to determine whether we should have included an interaction between birdkeeping and the number of cigarettes per day.
mA <- glm(LC ~ BK + CD, data = case2002, family = binomial)
mI <- glm(LC ~ BK * CD, data = case2002, family = binomial)
anova(mA, mI, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: LC ~ BK + CD
## Model 2: LC ~ BK * CD
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 144 164.36
## 2 143 164.35 1 0.0068661 0.934
Note that the title of this table is “Analysis of Deviance” which indicates you are working with GLMs rather than linear models.
Use a likelihood ratio test to determine whether an interaction between Age and Sex should be included in the logistic regression model of survival probability in the Donner party.
mA <- glm(Status ~ Age + Sex, data = case2001, family = binomial)
mI <- glm(Status ~ Age * Sex, data = case2001, family = binomial)
anova(mA,mI, test="Chi")
## Analysis of Deviance Table
##
## Model 1: Status ~ Age + Sex
## Model 2: Status ~ Age * Sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 42 51.256
## 2 41 47.346 1 3.9099 0.048 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To fit a mixed effect model in R, use the lme4 package.
For a linear regression model with random effects, use the
lmer
() function.
Here is a random intercept model.
library("lme4")
m <- lmer(Reaction ~ Days + (1| Subject), data = sleepstudy)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject)
## Data: sleepstudy
##
## REML criterion at convergence: 1786.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2257 -0.5529 0.0109 0.5188 4.2506
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject (Intercept) 1378.2 37.12
## Residual 960.5 30.99
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.4051 9.7467 25.79
## Days 10.4673 0.8042 13.02
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.371
Here is a random slope model
m <- lmer(Reaction ~ Days + (Days| Subject), data = sleepstudy)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
## Data: sleepstudy
##
## REML criterion at convergence: 1743.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9536 -0.4634 0.0231 0.4634 5.1793
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject (Intercept) 612.10 24.741
## Days 35.07 5.922 0.07
## Residual 654.94 25.592
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.825 36.838
## Days 10.467 1.546 6.771
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.138
For any generalized linear model, e.g. logistic and Poisson
regression, you can add random effects using the glmer()
function.
Random intercept model.
m <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp,
family = binomial)
summary(m)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(incidence, size - incidence) ~ period + (1 | herd)
## Data: cbpp
##
## AIC BIC logLik deviance df.resid
## 194.1 204.2 -92.0 184.1 51
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3816 -0.7889 -0.2026 0.5142 2.8791
##
## Random effects:
## Groups Name Variance Std.Dev.
## herd (Intercept) 0.4123 0.6421
## Number of obs: 56, groups: herd, 15
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3983 0.2312 -6.048 1.47e-09 ***
## period2 -0.9919 0.3032 -3.272 0.001068 **
## period3 -1.1282 0.3228 -3.495 0.000474 ***
## period4 -1.5797 0.4220 -3.743 0.000182 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) perid2 perid3
## period2 -0.363
## period3 -0.340 0.280
## period4 -0.260 0.213 0.198