POLS 7050
Spring 2009
April 23, 2009
Pulling it All Together: Generalized Linear Models
Throughout the course, we’ve considered the various models we’ve discussed discretely. In
fact, however, most of the models we’ve discussed can actually be considered under a unified
framework, as examples of a class of models known as generalized linear models (usually
abbreviated GLMs).
To understand GLMs, we first have to get a handle on the exponential family of probability
distributions. Once we have that, we can show how a bunch of the models we’ve been
learning about are special cases of the more general class of GLMs.
The Exponential Family
We’ll start with a random variable Z, which as realizations called z. We are typically
interested in the conditional density (PDF) of Z, where the conditioning is on some parameter
or parameters ψ:
f(z|θ) = Pr(Z = z|ψ)
A PDF is a member of the exponential family of probability functions if it can be written in
the form:
1
f(z|θ) = r(z)s(ψ) exp[q(z)h(ψ)] (1)
where r(·) and q(·) are functions of z that do not depend on ψ, and (conversely) s(·) and h(·)
are functions of ψ that do not depend on z. For reasons that will be apparent in a minute,
it is also necessary that r(z) > 0 and s(ψ) > 0.
With a little bit of algebra, we can rewrite (1) as:
f(z|ψ) = exp[ln r(z) + ln s(ψ) + q(z)h(ψ)]. (2)
Gill dubs the first two components the “additive” part, and the second the “interactive”
bit. Here, z will be the “response” variable, and we’ll use ψ to introduce covariates. It’s
important to emphasize that any PDF that can be written in this way is a member of the
exponential family; as we’ll see, this encompasses a really broad range of interesting densities.
1
This presentation owes a good deal to Jeff Gill’s exposition of GLMs at the Boston area JASA meeting,
Feb. 28, 1998.
1
“Canonical” Forms
The “canonical” form of the distribution results when q(z) = z. This means that we can
always “get to” the canonical form by transforming Z into Y according to y = q(z). In
canonical form, then, Y = Z is our “response” variable; in such circumstances, h(ψ) is
referred to as the natural parameter of the distribution. We can write
θ = h(ψ)
and recognize that, in some instances, h(ψ) = ψ (so that θ = ψ). This is known as the
“link” function. Writing this allows us to collect terms from (2) into a simpler formulation:
f[y|θ] = exp[yθ b(θ) + c(y)]. (3)
Here,
b(θ) sometimes called a “normalizing constant” is a function solely of the parame-
ter(s) θ,
c(y) is a function solely of y, the (potentially transformed) response variable, and
yθ is a multiplicative term between the response and the parameter(s).
Some Familiar Exponential Family Examples
If all of this seems a bit abstract, it shouldn’t. The key point to remember is that any
probability distribution that can be written in the forms in (1) - (3) is a member of the
exponential family. Consider three quick examples.
First, we know that if a variable Y follows a Poisson distribution, we can write its density
as:
f(y|λ) =
exp(λ)λ
y
y!
. (4)
Rearranging a bit, we can rewrite (4) as:
f(y|λ) = exp
ln
exp(λ)λ
y
y!

= exp[y ln(λ) λ ln(y!)] (5)
Here, we have θ = ln(λ), and (equivalently) λ b(θ) = exp(θ), while c(y) = ln(y!). The
Poisson is thus an example of a one-parameter exponential family distribution.
2
Distributional parameters that are outside of the formulation in (1) - (3) are usually called
nuisance parameters. They are typically not of central interest to researchers, though we
still must treat them in the exponential family formulation. A common kind of nuisance
parameter is a “scale parameter” – a parameter that defines the scale of the response variable
(that is, it “stretches” or “shrinks” the axis of that variable). We can incorporate such a
parameter (call it φ) through a slight generalization of (3):
f(y|θ, φ) = exp
yθ b(θ)
a(φ)
+ c(y, φ)
(6)
When no scale parameter is present, a(φ) = 1 and (6) reverts to (3). We can illustrate such
a distribution by considering our old friend, the normal distribution:
f(y|µ, σ
2
) =
1
2πσ
2
exp
(y µ)
2
2σ
2
(7)
In the normal, σ
2
can be thought of as a scale parameter it defines the “scale” (variance)
of Y . We can rewrite (7) in exponential form as:
f(y|µ, σ
2
) = exp
1
2
ln(2πσ
2
)
1
2σ
2
(y
2
2yµ + µ
2
)
= exp
1
2
ln(2πσ
2
)
1
2σ
2
y
2
+
1
2σ
2
2yµ
1
2σ
2
µ
2
= exp
yµ
σ
2
µ
2
2σ
2
y
2
2σ
2
1
2
ln(2πσ
2
)
= exp
"
yµ
µ
2
2
σ
2
+
1
2
y
2
σ
2
+ ln(2πσ
2
)
#
(8)
Here, θ = µ, which means that:
yθ = yµ,
b(θ) =
µ
2
2
,
a(φ) = σ
2
is the scale (nuisance) parameter, and
c(y, φ) =
1
2
y
2
σ
2
+ ln(2πσ
2
)
.
The Normal distribution is thus a member of the exponential family, and is in canonical form
as well. Other distributions that are canonical members of the family include the binomial
(with p as the canonical parameter), the gamma, and the inverse gaussian. Non-canonical
family members include the lognormal and Weibull distributions.
3
Little Red Likelihood
To estimate and make inferences about the parameters of interest, we can derive a general
likelihood for the class of models expressable as (3). That (log-)likelihood is:
ln L(θ, φ|y) = ln f(y|θ, φ)
= ln
exp
yθ b(θ)
a(φ)
+ c(y, φ)

=
yθ b(θ)
a(φ)
+ c(y, φ). (9)
This expresses the log-likelihood of the parameters in terms of the data. To obtain estimates,
we maximize this function in the usual way: by taking its first derivative with respect to the
parameter(s) of interest θ, setting it equal to zero, and solving. The former is:
ln L(θ, φ|y)
θ
S =
θ
yθ b(θ)
a(φ)
+ c(y, φ)
=
y
θ
b(θ)
a(φ)
. (10)
In GLM-speak, the quantity in (10) – the partial derivative of the log-likelihood with respect
to the parameter(s) of interest is typically called the “score function.” It has a number of
important properties:
S is a sufficient statistic for θ.
E(S) = 0.
Var(S) I(θ) = E[(S)
2
|θ], because E(S) = 0. This is known as the Fisher information
(hence, I(θ)).
Finally, assuming general regularity conditions (which all exponential family distributions
meet), it is the case that the conditional expectation of Y is just:
E(Y ) =
θ
b(θ) (11)
and
Var(Y ) = a(φ)
2
θ
2
b(θ) (12)
(see, e.g., McCullagh and Nelder 1989, pp. 26-29 for a derivation). This means that for
exponential family distributions we can get the mean of Y directly from b(θ), and the
variance of Y from b(θ) and a(φ). So, for our Poisson example, (11) and (12) mean that:
4
E(Y ) =
θ
exp(θ)
= exp(θ)|
θ=ln(λ)
= λ
and
Var(Y ) = 1 ×
2
θ
2
exp(θ)|
θ=ln(λ)
= exp[ln(λ)]
= λ
which is exactly what we’d expect for the Poisson. Likewise, from our normal example above,
(11) and (12) imply that:
E(Y ) =
θ
θ
2
2
= θ|
θ=µ
= µ
and
Var(Y ) = σ
2
×
2
θ
2
θ
2
2
= σ
2
×
θ
θ
= σ
2
There are lots more interesting things about exponential family distributions, but we’ll skip
them in the interest of saving time.
5
Generalized Linear Models
What does all this have to do with regression-like models? To answer that, let’s return to
our old friend, the continuous linear-normal regression model:
Y
i
= X
i
β + u
i
(13)
where Y
i
is the N ×1 vector for the response/“dependent” variable, X
i
is the N ×k matrix
of covariates, β is a k × 1 vector of parameters, and u
i
is a N × 1 vector of i.i.d. Normal
errors with mean zero and constant variance σ
2
. In this standard formulation, we usually
think of X
i
β as the “systematic component” of Y
i
, such that
E(Y
i
) µ
i
= X
i
β (14)
In this setup, the variable Y is distributed Normally, with mean µ
i
and variance σ
2
, and
wed can estimate the parameters β (and σ
2
) in the usual way.
The “Generalized” Part
We can generalize (14) by allowing the expected value of Y to vary with Xβ according to a
function g(·):
g(µ
i
) = X
i
β. (15)
g(·) is generally referred to as a “link function,” because its purpose is to “link” the ex-
pected value of Y
i
to X
i
β in a general, flexible way. We require g(·) to be continuously
twice-differentiable (that is, “smooth”) and monotonic in µ.
Importantly, note that we are not transforming the value of Y itself, but rather its expected
value. In fact, in this formulation, X
i
β informs a function of E(Y
i
), not E(Y
i
) itself. In
this way, g[E(Y
i
)] is required to meet all the usual Gauss-Markov assumptions (linearity,
homoscedasticity, normality, etc.) but E(Y
i
) itself need not do so.
It is useful to write
η
i
= X
i
β
where η
i
is often known as the “linear predictor” (or “index”); this means that we can rewrite
(15) as:
η
i
= g(µ
i
) (16)
Of course, this also means that we can invert the function g(·) to express µ
i
as a function
of X
i
β:
6
µ
i
= g
1
(η
i
)
= g
1
(X
i
β) (17)
Pulling all this together, we can think of a GLM as having three key components:
1. The random component is the stochastic part of Y ; it is distributed according to some
exponential family distribution with mean
E(Y
i
) = µ
i
.
2. The systematic component is the covariates and their associated parameters:
η
i
= X
i
β.
3. The link between the systematic and random parts of the model:
g(µ
i
) = η
i
and, equivalently,
g
1
(η
i
) = µ
i
.
The “generalized” part of GLM thus comes from two things: First, unlike the classical linear
regression model, the distribution of µ need not be Normal, and in fact can come from any
member of the exponential family we discussed above. Second, the link function in (15) need
not be an identity function, but instead can be any monotone, differentiable function.
Bringing In the Exponential Family
As we outlined above, we can think of distributions from the exponential family as having
a canonical parameter θ. In a GLM, we link the systematic part of the model (that is, the
covariates) to that parameter through the link function. Generically:
θ
i
= g(µ
i
)
= η
i
= X
i
β
which defines the expected value of Y . This means that, by setting θ
i
= η
i
, we are allowing
the transformed mean of the distribution to vary linearly in the covariates. In this light, one
7
can imagine extending the same idea to any of the distributions in the exponential family,
through the link function g
1
(·):
g
1
(θ
i
) = η
i
(18)
The link function that makes the linear predictor the same as the parameter θ is known as
the canonical link function. A useful aspect of the canonical link function is that it ensures
that there is a sufficient statistic of rank k (that is, the number of parameters in β).
Some Brief Examples
Linear-Normal
As a first example, consider data where Y is a continuous, unbounded, and normally-
distributed variable. In such an instance, the natural conditional distribution to choose
is the Normal distribution with mean µ and variance σ
2
:
f(y|µ, σ
2
) = N(µ, σ
2
)
Similarly, because there is no need to restrict the effect of X on Y , the simplest (and most
natural) link function is the identity function:
µ
i
= η
i
.
This combination of link function and distributional family leads to a model with:
µ
i
θ
i
= η
i
Y
i
N(µ
i
, σ
2
)
that is, the standard linear-normal regression model.
Binary
Now suppose instead that our response variable Y is binary. The natural distributional
family for a binary variable is the Bernoulli (that is, a binomial with n = 1):
f(y|π) = π
y
(1 π)
1y
.
The canonical link function for the Bernoulli (and, in fact for the binomial in general) is the
logit:
θ
i
= ln
µ
i
1 µ
i
This restricts the possible outcomes to fall in the [0, 1] interval, and thus restricts the impact
of X as well. It yields a model with:
8
µ
i
= g
1
(θ
i
)
=
exp(η
i
)
1 + exp(η
i
)
Y
i
Bernoulli(µ
i
)
that is, a standard logit model.
Count/Poisson
Finally, consider the case where Y
i
is an event count. As we discussed some time ago, the
natural distribution for a count of conditionally independent events is a Poisson:
f(y|λ) =
exp(λ)λ
y
y!
The canonical parameter of the Poisson is θ = λ; because we require the counts to be strictly
positive, the natural (and, as it happens, canonical) link function is the one which preserves
the nonnegativity of λ, the log:
ln(λ
i
) = η
i
which in turn yields a model with:
µ
i
= g
1
(θ
i
)
= exp(η
i
)
Y
i
Poisson(λ
i
)
that is, a Poisson model.
Table 1 lists several commonly-used GLMs, including their canonical link functions.
The biggest point here is that a large number of models including the classical linear
regression model, logit/probit/cloglog models for binary responses, and Poisson and negative
binomial models for event counts all fall under the GLM rubric. In addition, GLMs
also encompass a number of models (such as the gamma and the inverse Gaussian, both
for positive-continuous outcomes) that we have not discussed in class. Finally, the GLM
framework is a very general one, in that one can mix-and-match marginal distributions and
link functions to fit specific data problems; we’ll work through an example of this below.
9
Table 1: Common Flavors of GLMs
Distribution Range of Y Link(s) Inverse Link
Normal (−∞, ) Identity: θ = µ (Canonical) θ
Binomial [0,1] Logit: θ = ln
µ
1µ
(Canonical)
exp(θ)
1+exp(θ)
Probit: θ = Φ
1
(µ) Φ(θ)
C-Log-Log: θ = ln[ln(1 µ)] 1 exp[exp(θ)]
Poisson [0, ] (integers) Log: θ = ln(µ) (Canonical) exp(θ)
Gamma (0, ) Reciprocal: θ =
1
µ
(Canonical)
1
θ
Some Examples
As a practical matter, then, GLMs involve three steps:
1. Selecting a distribution f (Y ) for the stochastic component (from the family of ex-
ponential distributions) that best reflects the underlying data-generating process of
Y .
2. Selecting a link function g(·) appropriate for the nature of the response Y .
3. Specifying the model that is, choosing variables for inclusion into X.
I won’t go into detail about parameter estimation in the context of GLMs; suffice it to say
that the more-or-less standard methods we’ve been using all semester can be used here as
well. In addition, it is also possible to estimate GLMs using iteratively reweighted least
squares (IRLS); see McCullagh and Nelder (1989, pp. 40-42). This has the advantage of
being equivalent to MLE, but being computationally easier in many cases.
For now, we’ll illustrate the equivalence of GLMs to many of the models we’ve become
familiar with, using the data from last week’s homework exercise:
. su
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
amici | 7161 .8419215 2.189712 0 39
us | 0
year | 7157 71.93014 9.20792 53 86
fedparty | 7161 .3742494 .4839623 0 1
multlaw | 7156 .1489659 .3560797 0 1
-------------+--------------------------------------------------------
conflict | 7161 .0646558 .2459346 0 1
constit | 7161 .2535959 .4350993 0 1
10
Replicating OLS, Logit, Probit, and Poisson
The handout illustrates how one can use the glm command in Stata to obtain results exactly
like those we get using regress, logit, probit, and poisson, respectively. These are
illustrated on pp. 2-5 of the handout. I won’t go into a lot of detail on these, since there
isn’t much to say, except to note that:
1. Technically, glm yields exactly the same results as regress, logit, etc. only in those
instances where the link function is the canonical one. In other instances using a
probit or c-log-log link with a binary response, for example the standard errors are
only asymptotically equivalent. As a practical matter, this is usually not a big issue;
it certainly isn’t in the example data, where the N is large.
2. There are also some special considerations to think about when using glm to estimate
negative binomial models (not shown). Read more on this in the Stata manuals, or in
one or more of the references below, if it is something you care about.
An Extension: Models of Supreme Court Coalition Size
GLMs can also do things that none of the models heretofore discussed can do. To take
an example, consider the question of the size of a majority opinion coalition on the U.S.
Supreme Court. We know that:
To constitute a majority, the opinion coalition must be made up of a number of justices
equal to of greater than half of the justices who heard the case. However,
Not every case is heard and voted on by every justice sometimes there are vacancies,
sometimes justices recuse themselves for various reasons, and sometimes both occur.
As a practical matter, then, majority coalition sizes range from three (in 3-2 decisions, say,
where four justices are absent or recuse) to nine (in 9-0 unanimous cases). But the size of the
majority coalition is always at least partially determined by the number of justices voting in
the case.
Suppose we define N
i
as the number of justices voting in case i, and denote the size of the
majority coalition in that case as M
i
. Then it makes some sense
2
to think of M
i
as a binomial
function:
M
i
N
i
p
i
p
M
i
i
(1 p
i
)
N
i
M
i
We can estimate a model of a variable like M
i
using a GLM framework. In particular, we can
estimate a model of M
i
where N
i
is allowed to vary across i, and where p
i
the probability
2
To be brutally honest, this is not completely right. What we’d really want to do is to have a binomial
with a lower limit at the integer value of 0.5N
i
, since we’re only looking at majority coalitions. But, this
will do for pedagogical purposes.
11
of a marginal justice joining the majority coalition is allowed to vary systematically with
covariates X
i
. By choosing the binomial distribution from the exponential family, and a
logit link (the canonical link for the binomial distribution, and a natural choice, given that
p
i
is a probability), we get a model which is:
E(M
i
) =
exp(X
i
β)
1 + exp(X
i
β)
Y
i
Binomial(M
i
, N
i
)
We’ll estimate this GLM on data on the U.S. Supreme Court between 1953 and 2005, focusing
on three covariates:
term, the term in which the decision was handed down,
alt_prec, a dummy variable indicating whether (=1) or not (=0) the decision in
question altered existing Supreme Court precedent, and
unconst, an indicator of whether (=1) or not (=0) the decision formally declared a
federal, state, or local statute to be unconstitutional.
Summary statistics on the data look like this:
. su nmajority nvoting term alt_prec unconst
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
nmajority | 12491 7.045633 1.474143 3 9
nvoting | 12580 8.585453 .754199 4 9
term | 12583 1977.332 13.03965 1953 2005
alt_prec | 12583 .0168481 .1287074 0 1
unconst | 12583 .0665978 .249334 0 1
The results of the glm estimation are:
12
. glm nmajority term alt_prec unconst, family(binomial nvoting) link(logit) irls
Generalized linear models No. of obs = 12491
Optimization : MQL Fisher scoring Residual df = 12487
(IRLS EIM) Scale parameter = 1
Deviance = 25766.28517 (1/df) Deviance = 2.063449
Pearson = 20745.56508 (1/df) Pearson = 1.661373
Variance function: V(u) = u*(1-u/nvoting) [Binomial]
Link function : g(u) = ln(u/(nvoting-u)) [Logit]
BIC = -92020.63
------------------------------------------------------------------------------
| EIM
nmajority | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
term | -.0014731 .0006069 -2.43 0.015 -.0026626 -.0002837
alt_prec | -.2685583 .0560735 -4.79 0.000 -.3784603 -.1586563
unconst | -.2798177 .0294079 -9.52 0.000 -.3374563 -.2221792
_cons | 4.457874 1.200241 3.71 0.000 2.105444 6.810304
------------------------------------------------------------------------------
Note that we specify the number of “trials” in the binomial in the family() option to the
glm command; here, that is equal to the variable nvoting. The results suggest that
The sizes of majority coalitions declined (on average) over the past 50 years,
Both precedent-altering decisions and those declaring statutes unconstitutional gar-
nered smaller average majority coalitions than decisions that did not do so. This is
unsurprising, given that such decisions are more likely to be controversial, and therefore
less likely to command large majorities or a unanimous Court.
Of course, the nicest thing about this model is that the underlying statistical model is an
especially nice “fit” to the data-generating process.
13
Software, Further Reading, etc.
Software
As the examples illustrate, one can do GLMs all day long using Stata. Its glm command is
very flexible; options are:
family, which includes
gaussian (that is, normal),
igaussian (inverse Gaussian),
binomial (Bernoulli/binomial),
poisson (Poisson, duh...),
nbinomial (negative binomial),
gamma (gamma).
link, which have:
identity (µ = η)
log (ln(µ) = η),
logit (ln
µ
1µ
= η)
probit
1
(µ) = η)
cloglog (ln[ln(1 µ)] = η)
power (µ
`
= η, with ` {... 2, 1, 0, 1, 2, ...} )
opower [“odds power,”
(
µ
1µ
)
`
1
`
, with ` as in power],
nbinomial (negative binomial, ln
µ
µ+k
= η, with k = 1 or user-defined),
loglog (ln[ln(µ)] = η)
logc (“log-complement,” ln(1 µ) = η).
One can also specify an exposure() or offset() variable, where appropriate, “robust”
variance-covariance estimates, the method of maximization (IRLS or MLE), and other op-
tions specific to the distributions in question. Finally, specifying the eform option tells
Stata to report exp(
ˆ
β) rather than
ˆ
β; this is handy when the exponentiated coefficients take
on useful forms (as is the case when the model is a logit, or a Poisson or negative binomial).
14
The Stata help for glm helpfully notes the following:
If you specify both family() and link(), note that not all combinations make
sense. You may choose from the following combinations:
| id log logit probit clog pow opower nbinomial loglog logc
----------+-------------------------------------------------------------------
Gaussian | x x x
inv. Gau. | x x x
binomial | x x x x x x x x x
Poisson | x x x
neg. bin. | x x x x
gamma | x x x
Post-estimation, one can do all the “usual” things, including estat, predict, mfx, test,
testnl, etc. Many of these take on different forms as a function of the distributional family
chosen, so consult your manuals.
Beyond Stata , there are a host of other programs out there that will estimate GLMs. The
major ones are:
R (with the glm package),
SAS (with proc glm)
SPSS (with the glm library),
GLIM, Genstat, LispStat, and many others are free-standing packages that will estimate
GLMs.
References
One could easily teach an entire course on GLMs and, depending on one’s perspective,
that’s exactly what I’ve been doing. There are, as a result, a large number of books on the
subject. Here are a few, with my own thoughts on them:
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models, 2nd Ed. London: Chap-
man & Hall.
This is the “bible” of GLMs. It’s expensive, a bit dated, and the examples aren’t exactly
compelling to most social scientists (e.g., the length of time blood takes to clot, or growth
rates of Neurospora crassa fungi), but it’s still a great reference.
15
Hardin, James W., and Joseph W. Hilbe. 2007. Generalized Linear Models and Extensions,
2nd Ed. College Station, TX: Stata Press.
If you’re going to use Stata to do GLMs, this might be the one to have. It’s recent, written
in a relatively clear notation, and has profuse examples. The down-side is its cost, and its
weddedness to Stata for examples, etc.
Dobson, Annette J. 2001. An Introduction to Generalized Linear Models, 2nd Ed. London:
Chapman & Hall.
A relatively concise, not-too-technical introduction to GLMs. It’s also more-or-less current,
includes chapters on a few related methods (like HLMs), and best of all can be had in
paperback for relatively little money.
Gill, Jeff. 2005. Generalized Linear Models: A Unified Approach. Thousand Oaks, CA: Sage.
This is the “little green book” on GLMs. For a Sage monograph, it’s nicely done, with good
coverage and lots of nice examples (all of which are available on the author’s website). And,
as always, the price is right.
16