Sufficient Statistics and Exponential Family 1 Statistics and Sufficient

Math 541: Statistical Theory II

Suﬃcient Statistics and Exponential Family

Lecturer: Songfeng Zheng

1 Statistics and Suﬃcient Statistics

Suppose we have a random sample X

, · · · , X

taken from a distribution f(x|θ) which relies

on an unknown parameter θ in a parameter space Θ. The purpose of parameter estimation

is to estimate the parameter θ from the random sample.

We have already studied three parameter estimation methods: method of moment, maximum

likelihood, and Bayes estimation. We can see from the previous examples that the estimators

can be expressed as a function of the random sample X

, · · · , X

. Such a function is called

a statistic.

Formally, any real-valued function T = r(X

, · · · , X

) of the observations in the sam-

ple is called a statistic. In this function, there should not be any unknown parameter.

For example, suppose we have a random sample X

, · · · , X

, then

X, max(X

, · · · , X

median(X

, · · · , X

), and r(X

, · · · , X

) = 4 are statistics; however X

+ µ is not statistic if

µ is unknown.

For the parameter estimation problem, we know nothing about the parameter but the obser-

vations from such a distribution. Therefore, the observations X

, · · · , X

is our ﬁrst hand of

information source about the parameter, that is to say, all the available information about

the parameter is contained in the observations. However, we know that the estimators we

obtained are always functions of the observations, i.e., the estimators are statistics, e.g. sam-

ple mean, sample standard deviations, etc. In some sense, this process can be thought of as

“compress” the original observation data: initially we have n numbers, but after this “com-

pression”, we only have 1 numbers. This “compression” always makes us lose information

about the parameter, can never makes us obtain more information. The best case is that this

“compression” result contains the same amount of information as the information contained

in the n observations. We call such a statistic as suﬃcient statistic. From the above intuitive

analysis, we can see that suﬃcient statistic “absorbs” all the available information about θ

contained in the sample. This concept was introduced by R. A. Fisher in 1922.

If T (X

, · · · , X

) is a statistic and t is a particular value of T , then the conditional joint

distribution of X

, · · · , X

given that T = t can be calculated. In general, this joint con-

ditional distribution will depend on the value of θ. Therefore, for each value of t, there

will be a family of possible conditional distributions corresponding to the diﬀerent possible

values of θ ∈ Θ. However, it may happen that for each possible value of t, the conditional

joint distribution of X

, · · · , X

given that T = t is the same for all the values of θ ∈ Θ

and therefore does not actually depend on the value of θ. In this case, we say that T is a

suﬃcient statistic for the parameter θ.

Formally, a statistic T (X

, · · · , X

) is said to be suﬃcient for θ if the conditional distribution

of X

, · · · , X

, given T = t, does not depend on θ for any value of t.

In other words, given the value of T , we can gain no more knowledge about θ from knowing

more about the probability distribution of X

, · · · , X

. We could envision keeping only T

and throwing away all the X

without losing any information!

The concept of suﬃciency arises as an attempt to answer the following question: Is there

a statistic, i.e. a function T (X

, · · · , X

), that contains all the information in the sample

about θ? If so, a reduction or compression of the original data to this statistic without loss

of information is possible. For example, consider a sequence of independent Bernoulli trials

with unknown probability of success, θ. We may have the intuitive feeling that the total

number of successes contains all the information about θ that is in the sample, that the

order in which the successes occurred, for example, does not give any additional information

about θ.

Example 1: Let X

, · · · , X

be a sequence of independent bernoulli trials with P (X

1) = θ. We will verify that T =

i=1

is suﬃcient for θ.

Proof: We have

P (X

= x

, · · · , X

= x

|T = t) =

P (X

= x

, · · · , X

= x

)

P (T = t)

Bearing in mind that the X

can take on only the values 0s or 1s, the probability in the

numerator is the probability that some particular set of t X

are equal to 1s and the other

n − t are 0s. Since the X

are independent, the probability of this is θ

(1 − θ)

n−t

. To ﬁnd

the denominator, note that the distribution of T , the total number of ones, is binomial with

n trials and probability of success θ. Therefore the ratio in the above equation is

(1 − θ)

n−t

(1 − θ)

n−t

The conditional distribution thus does not involve θ at all. Given the total number of ones,

the probability that they occur on any particular set of t trials is the same for any value of

θ so that set of trials contains no additional information about θ.

2 Factorization Theorem

The preceding deﬁnition of suﬃciency is hard to work with, because it does not indicate

how to go about ﬁnding a suﬃcient statistic, and given a candidate statistic, T , it would

typically be very hard to conclude whether it was suﬃcient statistic because of the diﬃculty

in evaluating the conditional distribution.

We shall now present a simple method for ﬁnding a suﬃcient statistic which can be applied

in many problems. This method is based on the following result, which was developed with

increasing generality by R. A. Fisher in 1922, J. Neyman in 1935, and P. R. Halmos and L.

J. Savage in 1949, and this result is know as the Factorization Theorem.

Factorization Theorem: Let X

, · · · , X

form a random sample from either a continuous

distribution or a discrete distribution for which the pdf or the point mass function is f(x|θ),

where the value of θ is unknown and belongs to a given parameter space Θ. A statistic

T (X

, · · · , X

) is a suﬃcient statistic for θ if and only if the joint pdf or the joint point mass

function f

(x|θ) of X

, · · · , X

can be factorized as follows for all values of x = (x

, · · · , x

) ∈

and all values of θ ∈ Θ:

(x|θ) = u(x)v[T (x), θ].

Here, the function u and v are nonnegative, the function u may depend on x but does not

depend on θ, and the function v depends on θ but will depend on the observed value x only

through the value of the statistic T (x).

Note: In this expression, we can see that the statistic T (X

, · · · , X

) is like an “interface”

between the random sample X

, · · · , X

and the function v.

Proof: We give a proof for the discrete case. First, suppose the frequency function can be

factored in the given form. We let X = (X

, · · · , X

) and x = (x

, · · · , x

), then

P (T = t) =

x:T (x)=t

P (X = x) =

x:T (x)=t

u(x)v(T (x), θ) = v(t, θ)

x:T (x)=t

u(x)

Then we can have

P (X = x|T = t) =

P (X = x, T = t)

P (T = t)

u(x)

x:T (x)=t

u(x)

which does not depend on θ, and therefore T is a suﬃcient statistic.

Conversely, suppose the conditional distribution of X given T is independent of θ, that is T is

a suﬃcient statistic. Then we can let u(x) = P (X = x|T = t, θ), and let v(t, θ) = P (T = t|θ).

It follows that

P (X = x|θ) = P (X = x, T = t|θ) this is true because T (x) = t, it is redundant

= P (X = x|T = t, θ)P (T = t|θ) = u(x)v(T (x

, · · · , x

)), θ)

which is of the desired form.

Example 2: Suppose that X

, · · · , X

form a random sample from a Poisson distribution

for which the value of the mean θ is unknown (θ > 0). Show that T =

i=1

is a suﬃcient

statistic for θ.

Proof: For every set of nonnegative integers x

, · · · , x

, the joint probability mass function

(x|θ) of X

, · · · , X

is as follows:

(x|θ) =

i=1

−θ

i=1

−nθ

i=1

It can be seen that f

(x|θ) has been expressed as the product of a function that does not

depend on θ and a function that depends on θ but depends on the observed vector x only

through the value of

i=1

. By factorization theorem, it follows that T =

i=1

is a

suﬃcient statistic for θ.

Example 3: Applying the Factorization Theorem to a Continuous Distribution.

Suppose that X

, · · · , X

form a random sample from a continuous distribution with the

following p.d.f.:

f(x|θ) =

(

θx

θ−1

, for 0 < x < 1

0, otherwise

It is assumed that the value of the parameter θ is unknown (θ > 0). We shall show that

T =

i=1

is a suﬃcient statistic for θ.

Proof: For 0 < x

< 1 (i = 1, · · · , n), the joint p.d.f. f

(x|θ) of X

, · · · , X

is as follows:

(x|θ) = θ

i=1

θ−1

Furthermore, if at least one value of x

is outside the interval 0 < x

< 1, then f

(x|θ) = 0

for every value of θ ∈ Θ. The right side of the above equation depends on x only through the

value of the product

i=1

. Therefore, if we let u(x) = 1 and r(x) =

i=1

, then f

(x|θ)

can be considered to be factored in the form speciﬁed by the factorization theorem. It follows

from the factorization theorem that the statistic T =

i=1

is a suﬃcient statistic for θ.

Example 4: Suppose that X

, · · · , X

form a random sample from a normal distribution

for which the mean µ is unknown but the variance σ

is known. Find a suﬃcient statistic

for µ.

Solution: For x = (x

, · · · , x

), the joint pdf of X

, · · · , X

(x|µ) =

i=1

(2π)

1/2

exp

−

− µ)

2σ

This could be rewritten as

(x|µ) =

(2π)

n/2

exp

−

i=1

2σ

exp

i=1

−

nµ

2σ

It can be seen that f

(x|µ) has now been expressed as the product of a function that does

not depend on µ and a function the depends on x only through the value of

i=1

. It

follows from the factorization theorem that T =

i=1

is a suﬃcient statistic for µ.

Since

i=1

= n¯x, we can state equivalently that the ﬁnal expression depends on x only

through the value of ¯x, therefore

X is also a suﬃcient statistic for µ. More generally, every

one to one function of

X will be a suﬃcient statistic for µ.

Property of suﬃcient statistic: Suppose that X

, · · · , X

form a random sample from a

distribution for which the p.d.f. is f(x|θ), where the value of the parameter θ belongs to a

given parameter space Θ. Suppose that T (X

, · · · , X

) and T

, · · · , X

) are two statistics,

and there is a one-to-one map between T and T

; that is, the value of T

can be determined

from the value of T without knowing the values of X

, · · · , X

, and the value of T can be

determined from the value of T

without knowing the values of X

, · · · , X

. Then T

is a

suﬃcient statistic for θ if and only if T is a suﬃcient statistic for θ.

Proof: Suppose the one-to-one mapping between T and T

is g, i.e. T

= g(T ) and T =

−1

), and g

−1

is also one-to-one. T is a suﬃcient statistic if and only if the joint pdf

(X|T = t) can be factorized as

(x|T = t) = u(x)v[T (x), θ]

and this can be written as

u(x)v[T (x), θ] = u(x)v[g

−1

(x)), θ] = u(x)v

(x), θ]

Therefore the joint pdf can be factorized as u(x)v

(x), θ], by factorization theorem, T

a suﬃcient statistic.

For instance, in Example 4, we showed that for normal distribution, both T

i=1

and T

X are suﬃcient statistics, and there is a one-to-one mapping between T

and T

= T

/n. Other statistics like (

i=1

)

, exp (

i=1

) are also suﬃcient statistics.

Example 5: Suppose that X

, · · · , X

form a random sample from a beta distribution with

parameters α and β, where the value of α is known and the value of β is unknown (β > 0).

Show that the following statistic T is a suﬃcient statistic for the parameter β:

T =

i=1

log

1 − X

Proof: The p.d.f. f(x|β) of each individual observation X

f(x|β) =

(

Γ(α+β)

Γ(α)Γ(β)

α−1

(1 − x)

β−1

, for 0 ≤ x ≤ 1

0, otherwise

Therefore, the joint p.d.f. f

(x|β) of X

, · · · , X

(x|β) =

i=1

Γ(α + β)

Γ(α)Γ(β)

α−1

(1 − x

)

β−1

= Γ(α)

−n

i=1

α−1





Γ(α + β)

Γ(β)

i=1

(1 − x

)

β−1





We deﬁne T

, · · · , X

) =

i=1

(1 − X

), and because α is known, so we can deﬁne

u(x) = Γ(α)

−n

i=1

α−1

v(T

, β) =

Γ(α + β)

Γ(β)

, · · · , x

)

β−1

We can see that the function v depends on x only through T

, therefore T

is a suﬃcient

statistic.

It is easy to see that

T = g(T

) =

log(−T

)

and the function g is a one-to-one mapping. Therefore T is a suﬃcient statistic.

Example 6: Sampling for a Uniform distribution. Suppose that X

, · · · , X

form a

random sample from a uniform distribution on the interval [0, θ], where the value of the

parameter θ is unknown (θ > 0). We shall show that T = max(X

, · · · , X

) is a suﬃcient

statistic for θ.

Proof: The p.d.f. f(x|θ) of each individual observation X

f(x|θ) =

(

, for 0 ≤ x ≤ θ

0, otherwise

Therefore, the joint p.d.f. f

(x|θ) of X

, · · · , X

(x|θ) =

(

, for 0 ≤ x

≤ θ

0, otherwise

It can be seen that if x

< 0 for at least one value of i (i = 1, · · · , n), then f

(x|θ) = 0 for

every value of θ > 0. Therefore it is only necessary to consider the factorization of f

(x|θ)

for values of x

≥ 0 (i = 1, · · · , n).

Deﬁne h[max(x

, · · · , x

), θ] as

h[max(x

, · · · , x

), θ] =

(

1, if max(x

, · · · , x

) ≤ θ

0, if max(x

, · · · , x

) > θ

Also, x

≤ θ for i = 1, · · · , n if and only if max(x

, · · · , x

) ≤ θ. Therefore, for x

≥ 0

(i = 1, · · · , n), we can rewrite f

(x|θ) as follows:

(x|θ) =

h[max(x

, · · · , x

), θ].

Since the right side depends on x only through the value of max(x

, · · · , x

), it follows that

T = max(X

, · · · , X

) is a suﬃcient statistic for θ. According to the property of suﬃcient

statistic, any one-to-one function of T is a suﬃcient statistic as well.

Example 7: Suppose that X

, X

, · · · , X

are i.i.d. random variables on the interval [0, 1]

with the density function

f(x|α) =

Γ(2α)

Γ(α)

[x(1 − x)]

α−1

where α > 0 is a parameter to be estimated from the sample. Find a suﬃcient statistic for

α.

Solution: The joint density function of x

, · · · , x

f(x

, · · · , x

|α) =

i=1

Γ(2α)

Γ(α)

(1 − x

)]

α−1

Γ(2α)

Γ(α)

i=1

(1 − x

)

α−1

Comparing with the form in factorization theorem,

f(x

, · · · , x

|θ) = u(x

, · · · , x

)v[T (x

, · · · , x

), θ]

we see that T =

i=1

(1 − X

), v(t, θ) =

Γ(2α)

Γ(α)

α−1

, u(x

, · · · , x

) = 1, i.e. v depends on

, · · · , x

only through t. By the factorization theorem, T =

i=1

(1 − X

) is a suﬃcient

statistic. According to the property of suﬃcient statistic, any one-to-one function of T is a

suﬃcient statistic as well.

3 Suﬃcient Statistics and Estimators

We know estimators are statistics, in particular, we want the obtained estimator to be

suﬃcient statistic, since we want the estimator absorbs all the available information contained

in the sample.

Suppose that X

, · · · , X

form a random sample from a distribution for which the pdf or

point mass function is f(x|θ), where the value of the parameter θ is unknown. And we

assume there is a suﬃcient statistic for θ, which is T (X

, · · · , X

). We will show that the

MLE of θ,

θ, depends on X

, · · · , X

only through the statistic T .

It follows from the factorization theorem that the likelihood function f

(x|θ) can be written

(x|θ) = u(x)v[T (x), θ].

We know that the MLE

θ is the value of θ for which f

(x|θ) is maximized. We also know

that both u and v are positive. Therefore, it follows that

θ will be the value of θ for which

v[T (x), θ] is maximized. Since v[T (x), θ] depends on x only through the function T (x), it

follows that

θ will depend on x only through the function T (x). Thus the MLE estimator

is a function of the suﬃcient statistic T (X

, · · · , X

In many problems, the MLE

θ is actually a suﬃcient statistic. For instance, Example 1

shows a suﬃcient statistic for the success probability in Bernoulli trial is

i=1

, and we

know the MLE for θ is

X; Example 4 shows a suﬃcient statistic for µ in normal distribution

X, and this is the MLE for µ; Example 6 shows a suﬃcient statistic of θ for the uniform

distribution on (0, θ) is max(X

, · · · , X

), and this is the MLE for θ.

The above discussion for MLE also holds for Bayes estimator. Let θ be a parameter with

parameter space Θ equal to an interval of real numbers (possibly unbounded), and we assume

that the prior distribution for θ is p(θ). Let X have p.d.f. f(x|θ) conditional on θ. Suppose

we have a random sample X

, · · · , X

from f (x|θ). Let T (X

, · · · , X

) be a suﬃcient statistic.

We ﬁrst show that the posterior p.d.f. of θ given X = x depends on x only through T (x).

The likelihood term is f

(x|θ), according to Bayes formula, we have the posterior distribution

for θ is

f(θ|x) =

(x|θ)p(θ)

(x|θ)p(θ)dθ

u(x)v(T (x), θ)p(θ)

u(x)v(T (x), θ)p(θ)dθ

v(T (x), θ)p(θ)

v(T (x), θ)p(θ)dθ

where the second step uses the factorization theorem. We can see that the posterior p.d.f.

of θ given X = x depends on x only through T (x).

Since the Bayes estimator of θ with respect to a speciﬁed loss function is calculated from

this posterior p.d.f., the estimator also will depend on the observed vector x only through

the value of T (x). In other words, the Bayes estimator is a function of the suﬃcient statistic

T (X

, · · · , X

Summarizing our discussion above, both the MLE estimator and Bayes estimator are func-

tions of suﬃcient statistic, therefore they absorb all the available information contained in

the sample at hand.

4 Exponential Family of Probability Distribution

A study of the properties of probability distributions that have suﬃcient statistics of the same

dimension as the parameter space regardless of the sample size led to the development of

what is called the exponential family of probability distributions. Many common distributions,

including the normal, the binomial, the Poisson, and the gamma, are members of this family.

One-parameter members of the exponential family have density or mass function of the form

f(x|θ) = exp[c(θ)T (x) + d(θ) + S(x)]

Suppose that X

, · · · , X

are i.i.d. samples from a member of the exponential family, then

the joint probability function is

f(x|θ) =

i=1

exp[c(θ)T (x

) + d(θ) + S(x

)]

= exp

c(θ)

i=1

T (x

) + nd(θ)

exp

i=1

S(x

)

From this result, it is apparent by the factorization theorem that

i=1

T (x

) is a suﬃcient

statistic.

Example 8: The frequency function of Bernoulli distribution is

P (X = x) = θ

(1 − θ)

1−x

x = 0 or x = 1

= exp

x log

1 − θ

+ log(1 − θ )

(1)

It can be seen that this is a member of the exponential family with T (x) = x, and we can

also see that

i=1

is a suﬃcient statistic, which is the same as in example 1.

Example 9: Suppose that X

, X

, · · · , X

are i.i.d. random variables on the interval [0, 1]

with the density function

f(x|α) =

Γ(2α)

Γ(α)

[x(1 − x)]

α−1

where α > 0 is a parameter to be estimated from the sample. Find a suﬃcient statistic for

α by verifying that this distribution belongs to exponential family.

Solution: The density function

f(x|α) =

Γ(2α)

Γ(α)

[x(1−x)]

α−1

= exp {log Γ(2α) − 2 log Γ(α) + α log[x(1 − x)] − log[x(1 − x)]}

Comparing to the form of exponential family,

T (x) = log[x(1 − x)]; c(α) = α; S(x) = − log[x(1 − x)]; d(α) = log Γ(2α) − 2 log Γ(α)

Therefore, f (x|α) belongs to exponential family. Then the suﬃcient statistic is

i=1

T (X

) =

i=1

log[X

(1 − X

)] = log

i=1

(1 − X

)

In example 6, we got the suﬃcient statistic was

i=1

(1 − X

), which is diﬀerent from the

result here. But both of them are suﬃcient statistics because of the functional relationship

between them.

A k-parameter member of the exponential family has a density or frequency function of the

form

f(x|θ) = exp

i=1

(θ)T

(x) + d(θ) + S(x)

For example, the normal distribution, gamma distribution (Example below), beta distribu-

tion are of this form.

Example 10: Show that the gamma distribution belongs to the exponential family.

Proof: Gamma distribution has density function

f(x|α, β) =

Γ(α)

α−1

−βx

, 0 ≤ x < ∞

which can be written as

f(x|α, β) = exp {−βx + (α − 1) log x + α log β − log Γ(α)}

Comparing with the form of exponential family

exp

(

i=1

(θ)T

(x) + d(θ) + S(x)

)

We see that Gamma distribution has the form of exponential distribution with c

(α, β) = −β,

(α, β) = α − 1, T

(x) = x, T

(x) = log x, d(α, β) = α log β − log Γ(α), and S(x) = 0.

Therefore, gamma distribution belongs to the exponential family.

5 Exercises

Instructions for Exercises 1 to 4: In each of these exercises, assume that the random variables

, · · · , X

form a random sample of size n form the distribution speciﬁed in that exercise,

and show that the statistic T speciﬁed in the exercise is a suﬃcient statistic for the parameter:

Exercise 1: A normal distribution for which the mean µ is known and the variance σ

unknown; T =

i=1

− µ)

Exercise 2: A gamma distribution with parameters α and β, where the value of β is known

and the value of α is unknown (α > 0); T =

i=1

Exercise 3: A uniform distribution on the interval [a, b], where the value of a is known and

the value of b is unknown (b > a); T = max(X

, · · · , X

Exercise 4: A uniform distribution on the interval [a, b], where the value of b is known and

the value of a is unknown (b > a); T = min(X

, · · · , X

Exercise 5: Suppose that X

, · · · , X

form a random sample from a gamma distribution

with parameters α > 0 and β > 0, and the value of β is known. Show that the statistic

T =

i=1

log X

is a suﬃcient statistic for the parameter α.

Exercise 6: The Pareto distribution has density function:

f(x|x

, θ) = θx

−θ−1

, x ≥ x

, θ > 1

Assume that x

> 0 is given and that X

, X

, · · · , X

is an i.i.d. sample. Find a suﬃcient

statistic for θ by (a) using factorization theorem, (b) using the property of exponential family.

Are they the same? If not, why are both of them suﬃcient?

Exercise 7: Verify that following are members of exponential family:

a) Geometric distribution p(x) = p

x−1

(1 − p);

b) Poisson distribution p(x) = e

−λ

;

c) Normal distribution N(µ, σ

);

d) Beta distribution.