soci208 - module 12

SOCI208 - Module 12 - Inference for the Population Proportion

1. Population and Sample Proportions

1. Population Proportion

A population proportion, denoted p, is the proportion of the elements of a population that have a particular characteristic of interest.

Example:

the proportion of the U.S. population who believe in life after death
the proportion of the U.S. population who believe in hell
the proportion of voters intending to vote for Elizabeth Dole for a NC seat in the U.S. Senate in the next election

2. Sample Proportion

The number of sample observations with the characteristic of interest in a simple random sample of size n is denoted by f and is called the sample frequency of the characteristic.
The proportion of sample observations with the characteristic of interest is denoted by p. and is called the sample proportion of the characteristic.
The sample proportion and sample frequency are related by

p. = f/n

Example:

the frequency of people who believe in life after death (adding answers YES, DEFINITELY and YES, PROBABLY) in the 1998 GSS is f = 907 out of n = 1127 responses to that question; thus p. = 907/1127 = .8048 or 80.48%
the frequency of people who believe in hell (adding answers YES, DEFINITELY and YES, PROBABLY) in the 1998 GSS is f = 838 out of n = 1126 responses to that question; thus p. = 838/1126 = .7442 or 74.42%

2. Sampling Distribution of f and p_.

1. The Sampling Distribution of f and p. is a Bionomial Distribution!

Consider a simple random sample of size n from an infinite population of elements that have a certain characteristic with probability p.
Then the outcome of any one observation in the sample is a Bernouilli RV with value 1 if the element has the characteristic of interest, and 0 otherwise, and P(B = 1) = p, where p is the population proportion.
The n observations B₁, B₂, ..., B_n are independent. The sample frequency f is the sum of these n independent Bernouilli RVs so that

f = B₁ + B₂ + ... + B_n

Thus the sample frequency f is a binomial RV.
Therefore, when a simple random sample of sixe n os drawn from an infinite population with population proportion p, the sample frequency f has a sampling distribution given by (binomial probability function)

P(f) = (n!/(f!(n - f)!) p^f(1 - p)^{n - f}

where P(f) denotes the probability that f elements in the sample have the given characteristic, f = 0, 1, ..., n.
Furthermore the sample proportion p. has the same probability distribution as the corresponding f, because p. is simply f divided by a fixed constant n.
The identity of the sampling distributions of f and p. is shown in the next exhibit.

Exhibit - (NWW Figure 13.1 p. 366) [m12001.gif]

2. Characteristics of the Sampling Distributions of f and p_.

Mean, variance, and standard deviation of f and p. are shown in the following table.

**Table 1. Mean, Variance, and Standard Deviation of Sampling Distribution of f and p.**
Sampling Distribution of f	Sampling Distribution of p.
E{f} = np	E{p.} = p
s²{f} = np(1 - p)	s²{p.} = p(1 - p)/n
s{f} = (np(1 - p))^1/2	s{p.} = (p(1 - p)/n)^1/2

For derivations of E{p.} = p and s²{p.} = p(1 - p)/n see NWW p. 367 (bottom).

Example: n = 7 and p = 0.25

E{f} = 7(0.25) = 1.75 E{p.} = 0.25

s²{f} = 7(0.25)(0.75) = 1.3125 s²{p.} = (0.25)(0.75)/7 = 0.026786

s{f} = 1.3125^1/2 = 1.145644 s{p.} = 0.026786^1/2 = 0.163663

The shape of the sampling distribution of f and p. is

skewed to the right if p < 0.5
skewed to the left if p > 0.5
symmetrical if p = 0.5

3. Central Limit Theorem

Note that the sample proportion p. is a special case of the sample mean X., since p. = f/n = (B₁ + ... + B_n)/n, the mean of the B_i.
Thus all the results for X. hold, including the CLT.
When the sampling distribution is skewed because p <> 0.5, the skewness decreases as n increases, as seen in the next exhibit).

Exhibit - (NWW Figure 13.2 p. 368) [m12002.gif]

CLT for f and P - The sampling distributions of f and p. are approximately normal when the size n of the simple random sample is sufficiently large.
Rule of thumb - The normal approximation is adequate when both np >= 5 and n(1 - p) >= 5.

4. Using the Normal Approximation for f and p_.

1. Finding Probability Using the Normal Approximation for p.

When the rule of thumb for the normal approximation is satisfied, f and p. are approximated by the standardized variables

Z = (f - E{f})/s{f} = (f - np)/(np(1 - p))^1/2
Z = (p. - E{p.})/s{p.} = (p. - p)/(p(1 - p)/n)^1/2

Example: Suppose a sample of n = 200 from an infnite population with p = 0.7. One wants to know the probability that p. is between 0.65 and 0.75, using the normal approximation. Calculate

s{p.} = ((0.7)(0.3)/200)^1/2 = 0.0324

Then the z values are

z = (0.65 - 0.70)/0.0324 = -1.54
z = (0.75 - 0.70)/0.0324 = +1.54

So that

P{0.65 <= p. <= 0.75} = P{-1.54 <= Z <= 1.54) = 0.876

Exhibit - (NWW Figure 13.3 p. 370) [m12003.gif]

2. Correction for Continuity

See NWW Section 13.2 pp. 370-371.

Exhibit - (NWW Figure 13.4 p. 371) [m12004.gif]

5. Sampling Finite Populations

All previous results apply also to finite populations if the sampling fraction n/N is 5% or less (same rule as for sampling distribution of X.).

3. Interval Estimation of Population Proportion

1. CI for p - Large Sample

To construct a CI for p one needs an estimate of s{p.}. The estimated variance and standard deviation of p. are calculated as

s²{p.} = p.(1 - p.)/(n - 1)
s{p.} = (p.(1 - p.)/(n - 1))^1/2

The CI is based on an extension of the CLT stating that (p. - p)/s{p.} is approximately ~N(0,1) when the size of the simple random sample is sufficiently large.
Then, when the size of the simple random sample is sufficiently large, the approximate 1 - a confidence limits for p are

p. +/- z(1 - a/2)s{p.}

where s{p.} is given by the formula above.

Example (NWW p. 375): Fish Preference. A British food company tests consumer preference for whiting versus cod. A sample of n = 265 consumers is given a blind taste test of both types of fish prepared in the same fashion. Let f and p. denote the number and proportion of consumers who prefer whiting over cod. It is found that f = 144 and p. = 144/265 = 0.5434. The company wants a 95% CI for p.
First calculate

s{p.} = ((0.5434)(1 - 0.5434)/(264))^1/2 = 0.0306567

Thus the confidence limits are

L = 0.5434 - (1.960)(0.0306567) = 0.4833
U = 0.5434 + (1.960)(0.0306567) = 0.6035

With 95% confidence the proportion of consumers who prefer whiting is between 48.33% and 60.35%.

2. Planning of Sample Size

Under the same assumptions as before, given

a chosen confidence coefficient 1 - a
a desired half-width of the CI, denoted h
a planning value for p (based on past experience; or choose p = .5 if p cannot be estimated more precisely, since p = .5 implies the maximum possible standard error and is thus conservative)

one can specify the sample size n required to achieve this degree of confidence and precision as

n = (z²p(1 - p))/h²

where

z = z(1 - a/2)
p is the planning value for the population proportion

Example (NWW p. 373): Worker Location. In a study of the spatial distribution of the labor force in a large metropolitan area one wants to estimate the proportion p of workers whose place of employment is within 15 miles of their residence. It is desired to select a random sample of size n sufficient to provide a 95% CI for p with a half-width h = 0.02.
Based on other studies, a planning value of 0.9 is chosen.
Since 1 - a = 0.95, z(1 - a/2) = z(0.975) = 1.960.
Thus

n = (1.96)²(0.9)(0.1)/(0.02)² = 864

4. Tests for Population Proportion

When the size of the simple random sample is sufficiently large, the test statistic

z* = (p. - p₀)/s{p.}

where

s{p.} = (p₀(1 - p₀)/n)^1/2

is approximately ~N(0,1) when p = p₀.
Note that the estimate of s{p.} is based on the hypothetical value p₀.

Example: Fish Preference (cont'd). The British firm currently uses cod. They would like to use whiting, which is cheaper, if they can reject the hypothesis that more than 50% of consumers prefer cod to whiting at the a = .05 level. With p representing preference for whiting, the alternatives are

H₀: p >= 0.5
H₁: p < 0.5

Of 265 consumers in a blind taste test f = 144 preferred whiting to cod, so that p. = 144/265 = 0.5434.
(At this point, one could already reject H₁ out of hand, since p. > p₀, but continue for the sake of illustration.)
The test statistic is

z* = (p. - p)/s{p.} = (0.5434 - 0.5)/0.03071 = 1.41

This is a one-sided lower-tail test with P-value P{z* < 1.41} = 0.9207.
Since P-value 0.9207 > .05 one concludes H₀: there is no evidence that consumers prefer cod to whiting.

Exhibit - (NWW Figure 13.5 p. 375) [m12005.gif]

5. Power and Planning Sample Size for Test About p

1. Type I and Type II Errors (Repeat from Module 11)

**Types of Errors & Power of Test**
Conclusion from Sample\True Alternative	H₀	H₁
H₀	Correct p = 1 - a	Type II Error "False Negative" p = b
H₁	Type I Error "False Positive" p = a	Correct p = 1 - b = Power of Test

2. Power of Test

1. Rejection Probability

The power of a test is the rejection probability P(H₁|p) of the test, the probability that the decision rule will lead to conclusion H₁ for a given value of the population proportion p. This probability is a function of p.

Example: Fish Preference (cont'd). In this example p is the proportion of consumers who prefer whiting to cod. The hypothesis tested was H₀: p >= 0.5; H₁: p < 0.5.
What is the probability of rejecting H₀ if the actual value of p is, in fact, p = 0.35?
The situation is shown in the next exhibit.

Exhibit - (NWW Figure 13.6 p. 377) [m12006.gif]

First recover the value A of p corresponding to the critical value (action limit) of the test in the Neyman-Pearson method. The standardized critical value was z(0.05) = -1.645; p0 is .50; s{p.} was 0.03071. Thus the critical value is

A = p₀ + z(0.05)s{p.} = 0.5 - 1.645(0.03071) = 0.4495.

When p = 0.35 the sampling distribution of p. is centered at p = 0.35 and its standard deviation is

s{p.} = (p(1 - p)/n)^1/2 = (0.35(1 - 0.35)/265)^1/2 = 0.02930

Note: s{p.}, being a function of p, must be recalculated for each different value of p.
The desired rejection probability is the probability that p. (given p = 0.35) lies in the rejection region of the test, i.e. to the left of A = 0.4495. Relative to the sampling distribution for p = 0.35, p. = 0.4495 correspond to a standardized score

z = (0.4495 - 0.35)/0.02930 = 3.40

The rejection probability P(H₁|p = 0.35) is the shaded area under the sampling distribution of p. to the left of 3.40 so that

P(H₁|p = 0.35) = P(Z < 3.40) = 0.9997

Thus if p were, in fact, 0.35 the test is "powerful" in the sense that it would almost certainly conclude H₁.

2. Rejection Probability Curve aka Power Curve

One can calculate the rejection probability for many values of p in the same way (recalculating s{p.} for each value of p) and construct the rejection probability aka power curve shown in the next exhibit. Note that for any value of p < .40 the probability of concluding H₁ is quite high, i.e. the test is powerful.

Exhibit - (NWW Figure 13.7 p. 378) [m12007.gif]

3. Planning of Sample Size

Using the notion of power one can determine the sample size needed to control both the a and b risks at prespecified (small) levels.
The assumptions of the planning procedure are, as before

the size of the simple random sample ultimately determined is reasonably large
the population is infinite or, if finite, the sampling fraction is small

Notation:

p₀ is the value of p where the a risk is controlled (as before)
p₁ is the value of p where the b risk is controlled
z₀ is z value associated with probability a
z₁ is z value associated with probability b

Then the required random sample size to control both the a and b risks is

n = (|z₁|(p₁(1 - p₁))^1/2 + |z₀|(p₀(1 - p₀))^1/2)² / |p₁ - p₀|²

where for

One-Sided Upper-Tail Test (H₀: p <= p₀, H₁: p > p₀)
z₀ = z(1 - a) z₁ = z(b)
One-Sided Lower-Tail Test (H₀: p >= p₀, H₁: p < p₀)
z₀ = z(a) z₁ = z(1 - b)
Two-Sided Test (H₀: p = p₀, H₁: p <> p₀)
z₀ = z(1 - a/2) z₁ = z(b)

Example: Fish Preference (cont'd). Suppose the sample size has not yet been determined. The sample size is determined so that (see next exhibit)

the a risk is controlled at .05 when p₀ = 0.5
if only 40% of consumers prefer whiting, the probability b of concluding H₀ (p >= 0.5) is to be controlled at .05

Then, using the formula above, one finds (see NWW p. 380)

n = 265

which is the sample size actually used in that study.

Exhibit - (NWW Figure 13.8 p. 380) [m12008.gif]

Thus power considerations can be used in grant proposals to justify the necessity of large samples, ahead of the actual study, and therefore justify the generous funds requested for data collection and handling.
This is the spirit in which behavior geneticist Michael Neale (author of structural equations program Mx) has coined the immortal aphorism

Power is Money

Last modified 25 Oct 2002

E{f} = 7(0.25) = 1.75	E{p.} = 0.25
s²{f} = 7(0.25)(0.75) = 1.3125	s²{p.} = (0.25)(0.75)/7 = 0.026786
s{f} = 1.3125^1/2 = 1.145644	s{p.} = 0.026786^1/2 = 0.163663