POLS 7050 The Exponential Family

POLS 7050

Spring 2009

April 23, 2009

Pulling it All Together: Generalized Linear Models

Throughout the course, we’ve considered the various models we’ve discussed discretely. In

fact, however, most of the models we’ve discussed can actually be considered under a uniﬁed

framework, as examples of a class of models known as generalized linear models (usually

abbreviated GLMs).

To understand GLMs, we ﬁrst have to get a handle on the exponential family of probability

distributions. Once we have that, we can show how a bunch of the models we’ve been

learning about are special cases of the more general class of GLMs.

The Exponential Family

We’ll start with a random variable Z, which as realizations called z. We are typically

interested in the conditional density (PDF) of Z, where the conditioning is on some parameter

or parameters ψ:

f(z|θ) = Pr(Z = z|ψ)

A PDF is a member of the exponential family of probability functions if it can be written in

the form:

f(z|θ) = r(z)s(ψ) exp[q(z)h(ψ)] (1)

where r(·) and q(·) are functions of z that do not depend on ψ, and (conversely) s(·) and h(·)

are functions of ψ that do not depend on z. For reasons that will be apparent in a minute,

it is also necessary that r(z) > 0 and s(ψ) > 0.

With a little bit of algebra, we can rewrite (1) as:

f(z|ψ) = exp[ln r(z) + ln s(ψ) + q(z)h(ψ)]. (2)

Gill dubs the ﬁrst two components the “additive” part, and the second the “interactive”

bit. Here, z will be the “response” variable, and we’ll use ψ to introduce covariates. It’s

important to emphasize that any PDF that can be written in this way is a member of the

exponential family; as we’ll see, this encompasses a really broad range of interesting densities.

This presentation owes a good deal to Jeﬀ Gill’s exposition of GLMs at the Boston area JASA meeting,

Feb. 28, 1998.

“Canonical” Forms

The “canonical” form of the distribution results when q(z) = z. This means that we can

always “get to” the canonical form by transforming Z into Y according to y = q(z). In

canonical form, then, Y = Z is our “response” variable; in such circumstances, h(ψ) is

referred to as the natural parameter of the distribution. We can write

θ = h(ψ)

and recognize that, in some instances, h(ψ) = ψ (so that θ = ψ). This is known as the

“link” function. Writing this allows us to collect terms from (2) into a simpler formulation:

f[y|θ] = exp[yθ − b(θ) + c(y)]. (3)

Here,

• b(θ) – sometimes called a “normalizing constant” – is a function solely of the parame-

ter(s) θ,

• c(y) is a function solely of y, the (potentially transformed) response variable, and

• yθ is a multiplicative term between the response and the parameter(s).

Some Familiar Exponential Family Examples

If all of this seems a bit abstract, it shouldn’t. The key point to remember is that any

probability distribution that can be written in the forms in (1) - (3) is a member of the

exponential family. Consider three quick examples.

First, we know that if a variable Y follows a Poisson distribution, we can write its density

as:

f(y|λ) =

exp(−λ)λ

. (4)

Rearranging a bit, we can rewrite (4) as:

f(y|λ) = exp





exp(−λ)λ



= exp[y ln(λ) − λ − ln(y!)] (5)

Here, we have θ = ln(λ), and (equivalently) λ ≡ b(θ) = exp(θ), while c(y) = ln(y!). The

Poisson is thus an example of a one-parameter exponential family distribution.

Distributional parameters that are outside of the formulation in (1) - (3) are usually called

nuisance parameters. They are typically not of central interest to researchers, though we

still must treat them in the exponential family formulation. A common kind of nuisance

parameter is a “scale parameter” – a parameter that deﬁnes the scale of the response variable

(that is, it “stretches” or “shrinks” the axis of that variable). We can incorporate such a

parameter (call it φ) through a slight generalization of (3):

f(y|θ, φ) = exp



yθ −b(θ)

a(φ)

+ c(y, φ)



(6)

When no scale parameter is present, a(φ) = 1 and (6) reverts to (3). We can illustrate such

a distribution by considering our old friend, the normal distribution:

f(y|µ, σ

) =

√

2πσ

exp



(y − µ)

2σ



(7)

In the normal, σ

can be thought of as a scale parameter – it deﬁnes the “scale” (variance)

of Y . We can rewrite (7) in exponential form as:

f(y|µ, σ

) = exp



−

ln(2πσ

) −

2σ

− 2yµ + µ

)



= exp



−

ln(2πσ

) −

2σ

2yµ −

2σ



= exp



yµ

−

2σ

−

2σ

−

ln(2πσ

)



= exp

yµ −

−1



+ ln(2πσ

)



(8)

Here, θ = µ, which means that:

• yθ = yµ,

• b(θ) =

• a(φ) = σ

is the scale (nuisance) parameter, and

• c(y, φ) =

−1



+ ln(2πσ

)



The Normal distribution is thus a member of the exponential family, and is in canonical form

as well. Other distributions that are canonical members of the family include the binomial

(with p as the canonical parameter), the gamma, and the inverse gaussian. Non-canonical

family members include the lognormal and Weibull distributions.

Little Red Likelihood

To estimate and make inferences about the parameters of interest, we can derive a general

likelihood for the class of models expressable as (3). That (log-)likelihood is:

ln L(θ, φ|y) = ln f(y|θ, φ)

= ln



exp



yθ −b(θ)

a(φ)

+ c(y, φ)



yθ −b(θ)

a(φ)

+ c(y, φ). (9)

This expresses the log-likelihood of the parameters in terms of the data. To obtain estimates,

we maximize this function in the usual way: by taking its ﬁrst derivative with respect to the

parameter(s) of interest θ, setting it equal to zero, and solving. The former is:

∂ ln L(θ, φ|y)

∂θ

≡ S =

∂

∂θ



yθ −b(θ)

a(φ)

+ c(y, φ)



y −

∂

∂θ

b(θ)

a(φ)

. (10)

In GLM-speak, the quantity in (10) – the partial derivative of the log-likelihood with respect

to the parameter(s) of interest – is typically called the “score function.” It has a number of

important properties:

• S is a suﬃcient statistic for θ.

• E(S) = 0.

• Var(S) ≡ I(θ) = E[(S)

|θ], because E(S) = 0. This is known as the Fisher information

(hence, I(θ)).

Finally, assuming general regularity conditions (which all exponential family distributions

meet), it is the case that the conditional expectation of Y is just:

E(Y ) =

∂

∂θ

b(θ) (11)

and

Var(Y ) = a(φ)

∂

∂θ

b(θ) (12)

(see, e.g., McCullagh and Nelder 1989, pp. 26-29 for a derivation). This means that – for

exponential family distributions – we can get the mean of Y directly from b(θ), and the

variance of Y from b(θ) and a(φ). So, for our Poisson example, (11) and (12) mean that:

E(Y ) =

∂

∂θ

exp(θ)

= exp(θ)|

θ=ln(λ)

= λ

and

Var(Y ) = 1 ×

∂

∂θ

exp(θ)|

θ=ln(λ)

= exp[ln(λ)]

= λ

which is exactly what we’d expect for the Poisson. Likewise, from our normal example above,

(11) and (12) imply that:

E(Y ) =

∂

∂θ





= θ|

θ=µ

= µ

and

Var(Y ) = σ

∂

∂θ





= σ

∂

∂θ

= σ

There are lots more interesting things about exponential family distributions, but we’ll skip

them in the interest of saving time.

Generalized Linear Models

What does all this have to do with regression-like models? To answer that, let’s return to

our old friend, the continuous linear-normal regression model:

= X

β + u

(13)

where Y

is the N ×1 vector for the response/“dependent” variable, X

is the N ×k matrix

of covariates, β is a k × 1 vector of parameters, and u

is a N × 1 vector of i.i.d. Normal

errors with mean zero and constant variance σ

. In this standard formulation, we usually

think of X

β as the “systematic component” of Y

, such that

E(Y

) ≡ µ

= X

β (14)

In this setup, the variable Y is distributed Normally, with mean µ

and variance σ

, and

wed can estimate the parameters β (and σ

) in the usual way.

The “Generalized” Part

We can generalize (14) by allowing the expected value of Y to vary with Xβ according to a

function g(·):

g(µ

) = X

β. (15)

g(·) is generally referred to as a “link function,” because its purpose is to “link” the ex-

pected value of Y

to X

β in a general, ﬂexible way. We require g(·) to be continuously

twice-diﬀerentiable (that is, “smooth”) and monotonic in µ.

Importantly, note that we are not transforming the value of Y itself, but rather its expected

value. In fact, in this formulation, X

β informs a function of E(Y

), not E(Y

) itself. In

this way, g[E(Y

)] is required to meet all the usual Gauss-Markov assumptions (linearity,

homoscedasticity, normality, etc.) but E(Y

) itself need not do so.

It is useful to write

= X

where η

is often known as the “linear predictor” (or “index”); this means that we can rewrite

(15) as:

= g(µ

) (16)

Of course, this also means that we can invert the function g(·) to express µ

as a function

of X

β:

= g

−1

(η

)

= g

−1

β) (17)

Pulling all this together, we can think of a GLM as having three key components:

1. The random component is the stochastic part of Y ; it is distributed according to some

exponential family distribution with mean

E(Y

) = µ

2. The systematic component is the covariates and their associated parameters:

= X

β.

3. The link between the systematic and random parts of the model:

g(µ

) = η

and, equivalently,

−1

(η

) = µ

The “generalized” part of GLM thus comes from two things: First, unlike the classical linear

regression model, the distribution of µ need not be Normal, and in fact can come from any

member of the exponential family we discussed above. Second, the link function in (15) need

not be an identity function, but instead can be any monotone, diﬀerentiable function.

Bringing In the Exponential Family

As we outlined above, we can think of distributions from the exponential family as having

a canonical parameter θ. In a GLM, we link the systematic part of the model (that is, the

covariates) to that parameter through the link function. Generically:

= g(µ

)

= η

= X

which deﬁnes the expected value of Y . This means that, by setting θ

= η

, we are allowing

the transformed mean of the distribution to vary linearly in the covariates. In this light, one

can imagine extending the same idea to any of the distributions in the exponential family,

through the link function g

−1

(·):

−1

(θ

) = η

(18)

The link function that makes the linear predictor the same as the parameter θ is known as

the canonical link function. A useful aspect of the canonical link function is that it ensures

that there is a suﬃcient statistic of rank k (that is, the number of parameters in β).

Some Brief Examples

Linear-Normal

As a ﬁrst example, consider data where Y is a continuous, unbounded, and normally-

distributed variable. In such an instance, the natural conditional distribution to choose

is the Normal distribution with mean µ and variance σ

f(y|µ, σ

) = N(µ, σ

)

Similarly, because there is no need to restrict the eﬀect of X on Y , the simplest (and most

natural) link function is the identity function:

= η

This combination of link function and distributional family leads to a model with:

≡ θ

= η

∼ N(µ

, σ

)

that is, the standard linear-normal regression model.

Binary

Now suppose instead that our response variable Y is binary. The natural distributional

family for a binary variable is the Bernoulli (that is, a binomial with n = 1):

f(y|π) = π

(1 − π)

1−y

The canonical link function for the Bernoulli (and, in fact for the binomial in general) is the

logit:

= ln



1 − µ



This restricts the possible outcomes to fall in the [0, 1] interval, and thus restricts the impact

of X as well. It yields a model with:

= g

−1

(θ

)

exp(η

)

1 + exp(η

)

∼ Bernoulli(µ

)

that is, a standard logit model.

Count/Poisson

Finally, consider the case where Y

is an event count. As we discussed some time ago, the

natural distribution for a count of conditionally independent events is a Poisson:

f(y|λ) =

exp(−λ)λ

The canonical parameter of the Poisson is θ = λ; because we require the counts to be strictly

positive, the natural (and, as it happens, canonical) link function is the one which preserves

the nonnegativity of λ, the log:

ln(λ

) = η

which in turn yields a model with:

= g

−1

(θ

)

= exp(η

)

∼ Poisson(λ

)

that is, a Poisson model.

Table 1 lists several commonly-used GLMs, including their canonical link functions.

The biggest point here is that a large number of models – including the classical linear

regression model, logit/probit/cloglog models for binary responses, and Poisson and negative

binomial models for event counts – all fall under the GLM rubric. In addition, GLMs

also encompass a number of models (such as the gamma and the inverse Gaussian, both

for positive-continuous outcomes) that we have not discussed in class. Finally, the GLM

framework is a very general one, in that one can mix-and-match marginal distributions and

link functions to ﬁt speciﬁc data problems; we’ll work through an example of this below.

Table 1: Common Flavors of GLMs

Distribution Range of Y Link(s) Inverse Link

Normal (−∞, ∞) Identity: θ = µ (Canonical) θ

Binomial [0,1] Logit: θ = ln



1−µ



(Canonical)

exp(θ)

1+exp(θ)

Probit: θ = Φ

−1

(µ) Φ(θ)

C-Log-Log: θ = ln[−ln(1 − µ)] 1 − exp[−exp(θ)]

Poisson [0, ∞] (integers) Log: θ = ln(µ) (Canonical) exp(θ)

Gamma (0, ∞) Reciprocal: θ = −

(Canonical) −

Some Examples

As a practical matter, then, GLMs involve three steps:

1. Selecting a distribution f (Y ) for the stochastic component (from the family of ex-

ponential distributions) that best reﬂects the underlying data-generating process of

Y .

2. Selecting a link function g(·) appropriate for the nature of the response Y .

3. Specifying the model – that is, choosing variables for inclusion into X.

I won’t go into detail about parameter estimation in the context of GLMs; suﬃce it to say

that the more-or-less standard methods we’ve been using all semester can be used here as

well. In addition, it is also possible to estimate GLMs using iteratively reweighted least

squares (IRLS); see McCullagh and Nelder (1989, pp. 40-42). This has the advantage of

being equivalent to MLE, but being computationally easier in many cases.

For now, we’ll illustrate the equivalence of GLMs to many of the models we’ve become

familiar with, using the data from last week’s homework exercise:

. su

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

amici | 7161 .8419215 2.189712 0 39

us | 0

year | 7157 71.93014 9.20792 53 86

fedparty | 7161 .3742494 .4839623 0 1

multlaw | 7156 .1489659 .3560797 0 1

-------------+--------------------------------------------------------

conflict | 7161 .0646558 .2459346 0 1

constit | 7161 .2535959 .4350993 0 1

Replicating OLS, Logit, Probit, and Poisson

The handout illustrates how one can use the glm command in Stata to obtain results exactly

like those we get using regress, logit, probit, and poisson, respectively. These are

illustrated on pp. 2-5 of the handout. I won’t go into a lot of detail on these, since there

isn’t much to say, except to note that:

1. Technically, glm yields exactly the same results as regress, logit, etc. only in those

instances where the link function is the canonical one. In other instances – using a

probit or c-log-log link with a binary response, for example – the standard errors are

only asymptotically equivalent. As a practical matter, this is usually not a big issue;

it certainly isn’t in the example data, where the N is large.

2. There are also some special considerations to think about when using glm to estimate

negative binomial models (not shown). Read more on this in the Stata manuals, or in

one or more of the references below, if it is something you care about.

An Extension: Models of Supreme Court Coalition Size

GLMs can also do things that none of the models heretofore discussed can do. To take

an example, consider the question of the size of a majority opinion coalition on the U.S.

Supreme Court. We know that:

• To constitute a majority, the opinion coalition must be made up of a number of justices

equal to of greater than half of the justices who heard the case. However,

• Not every case is heard and voted on by every justice – sometimes there are vacancies,

sometimes justices recuse themselves for various reasons, and sometimes both occur.

As a practical matter, then, majority coalition sizes range from three (in 3-2 decisions, say,

where four justices are absent or recuse) to nine (in 9-0 unanimous cases). But the size of the

majority coalition is always at least partially determined by the number of justices voting in

the case.

Suppose we deﬁne N

as the number of justices voting in case i, and denote the size of the

majority coalition in that case as M

. Then it makes some sense

to think of M

as a binomial

function:

∼





(1 − p

)

−M

We can estimate a model of a variable like M

using a GLM framework. In particular, we can

estimate a model of M

where N

is allowed to vary across i, and where p

– the probability

To be brutally honest, this is not completely right. What we’d really want to do is to have a binomial

with a lower limit at the integer value of 0.5N

, since we’re only looking at majority coalitions. But, this

will do for pedagogical purposes.

of a marginal justice joining the majority coalition – is allowed to vary systematically with

covariates X

. By choosing the binomial distribution from the exponential family, and a

logit link (the canonical link for the binomial distribution, and a natural choice, given that

is a probability), we get a model which is:

E(M

) =

exp(X

β)

1 + exp(X

β)

∼ Binomial(M

, N

)

We’ll estimate this GLM on data on the U.S. Supreme Court between 1953 and 2005, focusing

on three covariates:

• term, the term in which the decision was handed down,

• alt_prec, a dummy variable indicating whether (=1) or not (=0) the decision in

question altered existing Supreme Court precedent, and

• unconst, an indicator of whether (=1) or not (=0) the decision formally declared a

federal, state, or local statute to be unconstitutional.

Summary statistics on the data look like this:

. su nmajority nvoting term alt_prec unconst

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

nmajority | 12491 7.045633 1.474143 3 9

nvoting | 12580 8.585453 .754199 4 9

term | 12583 1977.332 13.03965 1953 2005

alt_prec | 12583 .0168481 .1287074 0 1

unconst | 12583 .0665978 .249334 0 1

The results of the glm estimation are:

. glm nmajority term alt_prec unconst, family(binomial nvoting) link(logit) irls

Generalized linear models No. of obs = 12491

Optimization : MQL Fisher scoring Residual df = 12487

(IRLS EIM) Scale parameter = 1

Deviance = 25766.28517 (1/df) Deviance = 2.063449

Pearson = 20745.56508 (1/df) Pearson = 1.661373

Variance function: V(u) = u*(1-u/nvoting) [Binomial]

Link function : g(u) = ln(u/(nvoting-u)) [Logit]

BIC = -92020.63

------------------------------------------------------------------------------

| EIM

nmajority | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

term | -.0014731 .0006069 -2.43 0.015 -.0026626 -.0002837

alt_prec | -.2685583 .0560735 -4.79 0.000 -.3784603 -.1586563

unconst | -.2798177 .0294079 -9.52 0.000 -.3374563 -.2221792

_cons | 4.457874 1.200241 3.71 0.000 2.105444 6.810304

------------------------------------------------------------------------------

Note that we specify the number of “trials” in the binomial in the family() option to the

glm command; here, that is equal to the variable nvoting. The results suggest that

• The sizes of majority coalitions declined (on average) over the past 50 years,

• Both precedent-altering decisions and those declaring statutes unconstitutional gar-

nered smaller average majority coalitions than decisions that did not do so. This is

unsurprising, given that such decisions are more likely to be controversial, and therefore

less likely to command large majorities or a unanimous Court.

Of course, the nicest thing about this model is that the underlying statistical model is an

especially nice “ﬁt” to the data-generating process.

Software, Further Reading, etc.

Software

As the examples illustrate, one can do GLMs all day long using Stata. Its glm command is

very ﬂexible; options are:

• family, which includes

◦ gaussian (that is, normal),

◦ igaussian (inverse Gaussian),

◦ binomial (Bernoulli/binomial),

◦ poisson (Poisson, duh...),

◦ nbinomial (negative binomial),

◦ gamma (gamma).

• link, which have:

◦ identity (µ = η)

◦ log (ln(µ) = η),

◦ logit (ln



1−µ



= η)

◦ probit (Φ

−1

(µ) = η)

◦ cloglog (ln[−ln(1 − µ)] = η)

◦ power (µ

= η, with ` ∈ {... − 2, −1, 0, 1, 2, ...} )

◦ opower [“odds power,”

(

1−µ

)

−1

, with ` as in power],

◦ nbinomial (negative binomial, ln



µ+k



= η, with k = 1 or user-deﬁned),

◦ loglog (−ln[−ln(µ)] = η)

◦ logc (“log-complement,” ln(1 − µ) = η).

One can also specify an exposure() or offset() variable, where appropriate, “robust”

variance-covariance estimates, the method of maximization (IRLS or MLE), and other op-

tions speciﬁc to the distributions in question. Finally, specifying the eform option tells

Stata to report exp(

β) rather than

β; this is handy when the exponentiated coeﬃcients take

on useful forms (as is the case when the model is a logit, or a Poisson or negative binomial).

The Stata help for glm helpfully notes the following:

If you specify both family() and link(), note that not all combinations make

sense. You may choose from the following combinations:

| id log logit probit clog pow opower nbinomial loglog logc

----------+-------------------------------------------------------------------

Gaussian | x x x

inv. Gau. | x x x

binomial | x x x x x x x x x

Poisson | x x x

neg. bin. | x x x x

gamma | x x x

Post-estimation, one can do all the “usual” things, including estat, predict, mfx, test,

testnl, etc. Many of these take on diﬀerent forms as a function of the distributional family

chosen, so consult your manuals.

Beyond Stata , there are a host of other programs out there that will estimate GLMs. The

major ones are:

• R (with the glm package),

• SAS (with proc glm)

• SPSS (with the glm library),

• GLIM, Genstat, LispStat, and many others are free-standing packages that will estimate

GLMs.

References

One could easily teach an entire course on GLMs – and, depending on one’s perspective,

that’s exactly what I’ve been doing. There are, as a result, a large number of books on the

subject. Here are a few, with my own thoughts on them:

McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models, 2nd Ed. London: Chap-

man & Hall.

This is the “bible” of GLMs. It’s expensive, a bit dated, and the examples aren’t exactly

compelling to most social scientists (e.g., the length of time blood takes to clot, or growth

rates of Neurospora crassa fungi), but it’s still a great reference.

Hardin, James W., and Joseph W. Hilbe. 2007. Generalized Linear Models and Extensions,

2nd Ed. College Station, TX: Stata Press.

If you’re going to use Stata to do GLMs, this might be the one to have. It’s recent, written

in a relatively clear notation, and has profuse examples. The down-side is its cost, and its

weddedness to Stata for examples, etc.

Dobson, Annette J. 2001. An Introduction to Generalized Linear Models, 2nd Ed. London:

Chapman & Hall.

A relatively concise, not-too-technical introduction to GLMs. It’s also more-or-less current,

includes chapters on a few related methods (like HLMs), and – best of all – can be had in

paperback for relatively little money.

Gill, Jeﬀ. 2005. Generalized Linear Models: A Uniﬁed Approach. Thousand Oaks, CA: Sage.

This is the “little green book” on GLMs. For a Sage monograph, it’s nicely done, with good

coverage and lots of nice examples (all of which are available on the author’s website). And,

as always, the price is right.