SOCI208 - Module 12 - Inference for the Population Proportion
1. Population and Sample Proportions
1. Population Proportion
A population proportion, denoted p, is the proportion of
the elements of a population that have a particular characteristic of interest.
Example:
-
the proportion of the U.S. population who believe in life after death
-
the proportion of the U.S. population who believe in hell
-
the proportion of voters intending to vote for Elizabeth Dole for a NC
seat in the U.S. Senate in the next election
2. Sample Proportion
The number of sample observations with the characteristic of interest in
a simple random sample of size n is denoted by f and is called the sample
frequency of the characteristic.
The proportion of sample observations with the characteristic of interest
is denoted by p. and is called the sample proportion of the
characteristic.
The sample proportion and sample frequency are related by
p. = f/n
Example:
-
the frequency of people who believe in life after death (adding answers
YES, DEFINITELY and YES, PROBABLY) in the 1998 GSS is f = 907 out of n
= 1127 responses to that question; thus p. = 907/1127 = .8048 or 80.48%
-
the frequency of people who believe in hell (adding answers YES, DEFINITELY
and YES, PROBABLY) in the 1998 GSS is f = 838 out of n = 1126 responses
to that question; thus p. = 838/1126 = .7442 or 74.42%
2. Sampling Distribution of f and p.
1. The Sampling Distribution of f and p. is a Bionomial Distribution!
Consider a simple random sample of size n from an infinite population of
elements that have a certain characteristic with probability p.
Then the outcome of any one observation in the sample is a Bernouilli
RV with value 1 if the element has the characteristic of interest, and
0 otherwise, and P(B = 1) = p, where p is the population proportion.
The n observations B1, B2, ..., Bn
are independent. The sample frequency f is the sum of these n independent
Bernouilli RVs so that
f = B1 + B2 + ... + Bn
Thus the sample frequency f is a binomial RV.
Therefore, when a simple random sample of sixe n os drawn from an infinite
population with population proportion p, the sample frequency f has a sampling
distribution given by (binomial probability function)
P(f) = (n!/(f!(n - f)!) pf(1 - p)n - f
where P(f) denotes the probability that f elements in the sample have the
given characteristic, f = 0, 1, ..., n.
Furthermore the sample proportion p. has the same probability distribution
as the corresponding f, because p. is simply f divided by a fixed constant
n.
The identity of the sampling distributions of f and p. is shown in
the next exhibit.
Exhibit - (NWW Figure 13.1 p. 366) [m12001.gif]
2. Characteristics of the Sampling Distributions of f and p.
Mean, variance, and standard deviation of f and p. are shown in the following
table.
Table 1. Mean, Variance, and Standard Deviation of Sampling
Distribution of f and p.
Sampling Distribution of f |
Sampling Distribution of p. |
E{f} = np |
E{p.} = p |
s2{f} = np(1 - p) |
s2{p.} = p(1 - p)/n |
s{f} = (np(1 - p))1/2 |
s{p.} = (p(1 - p)/n)1/2 |
For derivations of E{p.} = p and s2{p.}
= p(1 - p)/n see NWW p. 367 (bottom).
Example: n = 7 and p = 0.25
E{f} = 7(0.25) = 1.75 |
E{p.} = 0.25 |
s2{f} = 7(0.25)(0.75) = 1.3125 |
s2{p.} = (0.25)(0.75)/7 = 0.026786 |
s{f} = 1.31251/2 = 1.145644 |
s{p.} = 0.0267861/2 = 0.163663 |
The shape of the sampling distribution of f and p. is
skewed to the right if p < 0.5
skewed to the left if p > 0.5
symmetrical if p = 0.5
3. Central Limit Theorem
Note that the sample proportion p. is a special case of the sample mean
X., since p. = f/n = (B1 + ... + Bn)/n, the mean
of the Bi.
Thus all the results for X. hold, including the CLT.
When the sampling distribution is skewed because p <> 0.5, the skewness
decreases as n increases, as seen in the next exhibit).
Exhibit - (NWW Figure 13.2 p. 368) [m12002.gif]
CLT for f and P - The sampling distributions of f and p. are
approximately normal when the size n of the simple random sample is sufficiently
large.
Rule of thumb - The normal approximation is adequate when both np >=
5 and n(1 - p) >= 5.
4. Using the Normal Approximation for f and p.
1. Finding Probability Using the Normal Approximation for p.
When the rule of thumb for the normal approximation is satisfied, f and
p. are approximated by the standardized variables
Z = (f - E{f})/s{f} = (f - np)/(np(1
- p))1/2
Z = (p. - E{p.})/s{p.} = (p. - p)/(p(1 -
p)/n)1/2
Example: Suppose a sample of n = 200 from an infnite population with p
= 0.7. One wants to know the probability that p. is between 0.65
and 0.75, using the normal approximation. Calculate
s{p.} = ((0.7)(0.3)/200)1/2
= 0.0324
Then the z values are
z = (0.65 - 0.70)/0.0324 = -1.54
z = (0.75 - 0.70)/0.0324 = +1.54
So that
P{0.65 <= p. <= 0.75} = P{-1.54 <= Z <= 1.54) =
0.876
Exhibit - (NWW Figure 13.3 p. 370) [m12003.gif]
2. Correction for Continuity
See NWW Section 13.2 pp. 370-371.
Exhibit - (NWW Figure 13.4 p. 371) [m12004.gif]
5. Sampling Finite Populations
All previous results apply also to finite populations if the sampling fraction
n/N is 5% or less (same rule as for sampling distribution of X.).
3. Interval Estimation of Population Proportion
1. CI for p - Large Sample
To construct a CI for p one needs an estimate of s{p.}.
The estimated variance and standard deviation of p. are calculated as
s2{p.} = p.(1 - p.)/(n - 1)
s{p.} = (p.(1 - p.)/(n - 1))1/2
The CI is based on an extension of the CLT stating that (p. - p)/s{p.}
is approximately ~N(0,1) when the size of the simple random sample is sufficiently
large.
Then, when the size of the simple random sample is sufficiently large,
the approximate 1 - a confidence limits for
p are
p. +/- z(1 - a/2)s{p.}
where s{p.} is given by the formula above.
Example (NWW p. 375): Fish Preference. A British food company
tests consumer preference for whiting versus cod. A sample of n =
265 consumers is given a blind taste test of both types of fish prepared
in the same fashion. Let f and p. denote the number and proportion
of consumers who prefer whiting over cod. It is found that f = 144
and p. = 144/265 = 0.5434. The company wants a 95% CI for p.
First calculate
s{p.} = ((0.5434)(1 - 0.5434)/(264))1/2 = 0.0306567
Thus the confidence limits are
L = 0.5434 - (1.960)(0.0306567) = 0.4833
U = 0.5434 + (1.960)(0.0306567) = 0.6035
With 95% confidence the proportion of consumers who prefer whiting is between
48.33% and 60.35%.
2. Planning of Sample Size
Under the same assumptions as before, given
-
a chosen confidence coefficient 1 - a
-
a desired half-width of the CI, denoted h
-
a planning value for p (based on past experience; or choose p = .5 if p
cannot be estimated more precisely, since p = .5 implies the maximum possible
standard error and is thus conservative)
one can specify the sample size n required to achieve this degree of confidence
and precision as
n = (z2p(1 - p))/h2
where
z = z(1 - a/2)
p is the planning value for the population proportion
Example (NWW p. 373): Worker Location. In a study of the spatial
distribution of the labor force in a large metropolitan area one wants
to estimate the proportion p of workers whose place of employment is within
15 miles of their residence. It is desired to select a random sample
of size n sufficient to provide a 95% CI for p with a half-width h = 0.02.
Based on other studies, a planning value of 0.9 is chosen.
Since 1 - a = 0.95, z(1 - a/2)
= z(0.975) = 1.960.
Thus
n = (1.96)2(0.9)(0.1)/(0.02)2 = 864
4. Tests for Population Proportion
When the size of the simple random sample is sufficiently large, the test
statistic
z* = (p. - p0)/s{p.}
where
s{p.} = (p0(1 - p0)/n)1/2
is approximately ~N(0,1) when p = p0.
Note that the estimate of s{p.} is based
on the hypothetical value p0.
Example: Fish Preference (cont'd). The British firm currently
uses cod. They would like to use whiting, which is cheaper, if they
can reject the hypothesis that more than 50% of consumers prefer cod to
whiting at the a = .05 level. With p representing
preference for whiting, the alternatives are
H0: p >= 0.5
H1: p < 0.5
Of 265 consumers in a blind taste test f = 144 preferred whiting to cod,
so that p. = 144/265 = 0.5434.
(At this point, one could already reject H1 out of hand,
since p. > p0, but continue for the sake of illustration.)
The test statistic is
z* = (p. - p)/s{p.} = (0.5434 - 0.5)/0.03071 = 1.41
This is a one-sided lower-tail test with P-value P{z* < 1.41} = 0.9207.
Since P-value 0.9207 > .05 one concludes H0: there is no
evidence that consumers prefer cod to whiting.
Exhibit - (NWW Figure 13.5 p. 375) [m12005.gif]
5. Power and Planning Sample Size for Test About p
1. Type I and Type II Errors (Repeat from Module 11)
Types of Errors & Power of Test
Conclusion from Sample\True Alternative |
H0
|
H1
|
H0 |
Correct
p = 1 - a
|
Type II Error
"False Negative"
p = b
|
H1 |
Type I Error
"False Positive"
p = a
|
Correct
p = 1 - b
= Power of Test
|
2. Power of Test
1. Rejection Probability
The power of a test is the rejection probability P(H1|p) of
the test, the probability that the decision rule will lead to conclusion
H1 for a given value of the population proportion p. This
probability is a function of p.
Example: Fish Preference (cont'd). In this example p is the proportion
of consumers who prefer whiting to cod. The hypothesis tested was
H0: p >= 0.5; H1: p < 0.5.
What is the probability of rejecting H0 if the actual value
of p is, in fact, p = 0.35?
The situation is shown in the next exhibit.
Exhibit - (NWW Figure 13.6 p. 377) [m12006.gif]
First recover the value A of p corresponding to the critical value (action
limit) of the test in the Neyman-Pearson method. The standardized
critical value was z(0.05) = -1.645; p0 is .50; s{p.} was 0.03071.
Thus the critical value is
A = p0 + z(0.05)s{p.} = 0.5 - 1.645(0.03071) = 0.4495.
When p = 0.35 the sampling distribution of p. is centered at p = 0.35 and
its standard deviation is
s{p.} = (p(1 - p)/n)1/2
= (0.35(1 - 0.35)/265)1/2 = 0.02930
Note: s{p.}, being a function of p, must be
recalculated for each different value of p.
The desired rejection probability is the probability that p. (given
p = 0.35) lies in the rejection region of the test, i.e. to the left of
A = 0.4495. Relative to the sampling distribution for p = 0.35, p.
= 0.4495 correspond to a standardized score
z = (0.4495 - 0.35)/0.02930 = 3.40
The rejection probability P(H1|p = 0.35) is the shaded area
under the sampling distribution of p. to the left of 3.40 so that
P(H1|p = 0.35) = P(Z < 3.40) = 0.9997
Thus if p were, in fact, 0.35 the test is "powerful" in the sense that
it would almost certainly conclude H1.
2. Rejection Probability Curve aka Power Curve
One can calculate the rejection probability for many values of p in the
same way (recalculating s{p.} for each value
of p) and construct the rejection probability aka power curve shown
in the next exhibit. Note that for any value of p < .40 the probability
of concluding H1 is quite high, i.e. the test is powerful.
Exhibit - (NWW Figure 13.7 p. 378) [m12007.gif]
3. Planning of Sample Size
Using the notion of power one can determine the sample size needed to control
both the a and b
risks at prespecified (small) levels.
The assumptions of the planning procedure are, as before
-
the size of the simple random sample ultimately determined is reasonably
large
-
the population is infinite or, if finite, the sampling fraction is small
Notation:
-
p0 is the value of p where the a
risk is controlled (as before)
-
p1 is the value of p where the b
risk is controlled
-
z0 is z value associated with probability a
-
z1 is z value associated with probability b
Then the required random sample size to control both the a
and b risks is
n = (|z1|(p1(1 - p1))1/2
+ |z0|(p0(1 - p0))1/2)2
/ |p1 - p0|2
where for
One-Sided Upper-Tail Test (H0: p <= p0,
H1: p > p0)
z0 = z(1 - a) z1
= z(b)
One-Sided Lower-Tail Test (H0: p >= p0, H1:
p < p0)
z0 = z(a) z1
= z(1 - b)
Two-Sided Test (H0: p = p0, H1: p <>
p0)
z0 = z(1 - a/2)
z1 = z(b)
Example: Fish Preference (cont'd). Suppose the sample size has not
yet been determined. The sample size is determined so that (see next
exhibit)
-
the a risk is controlled at .05 when p0
= 0.5
-
if only 40% of consumers prefer whiting, the probability b
of concluding H0 (p >= 0.5) is to be controlled at .05
Then, using the formula above, one finds (see NWW p. 380)
n = 265
which is the sample size actually used in that study.
Exhibit - (NWW Figure 13.8 p. 380) [m12008.gif]
Thus power considerations can be used in grant proposals to justify the
necessity of large samples, ahead of the actual study, and therefore justify
the generous funds requested for data collection and handling.
This is the spirit in which behavior geneticist Michael Neale (author
of structural equations program Mx) has coined the immortal aphorism
Power is Money
Last modified 25 Oct 2002