SOCI208 - Module 12 - Inference for the Population Proportion

1.  Population and Sample Proportions

1.  Population Proportion

A population proportion, denoted p, is the proportion of the elements of a population that have a particular characteristic of interest.

Example:

2.  Sample Proportion

The number of sample observations with the characteristic of interest in a simple random sample of size n is denoted by f and is called the sample frequency of the characteristic.
The proportion of sample observations with the characteristic of interest is denoted by p. and is called the sample proportion of the characteristic.
The sample proportion and sample frequency are related by
p. = f/n
Example:

2.  Sampling Distribution of f and p.

1.  The Sampling Distribution of f and p. is a Bionomial Distribution!

Consider a simple random sample of size n from an infinite population of elements that have a certain characteristic with probability p.
Then the outcome of any one observation in the sample is a Bernouilli RV with value 1 if the element has the characteristic of interest, and 0 otherwise, and P(B = 1) = p, where p is the population proportion.
The n observations B1, B2, ..., Bn are independent.  The sample frequency f is the sum of these n independent Bernouilli RVs so that
f = B1 + B2 + ... + Bn
Thus the sample frequency f is a binomial RV.
Therefore, when a simple random sample of sixe n os drawn from an infinite population with population proportion p, the sample frequency f has a sampling distribution given by (binomial probability function)
P(f) = (n!/(f!(n - f)!) pf(1 - p)n - f
where P(f) denotes the probability that f elements in the sample have the given characteristic, f = 0, 1, ..., n.
Furthermore the sample proportion p. has the same probability distribution as the corresponding f, because p. is simply f divided by a fixed constant n.
The identity of the sampling distributions of f and p. is shown in the next exhibit.
Exhibit - (NWW Figure 13.1 p. 366) [m12001.gif]

2.  Characteristics of the Sampling Distributions of f and p.

Mean, variance, and standard deviation of f and p. are shown in the following table.
 
Table 1.  Mean, Variance, and Standard Deviation of Sampling Distribution of f and p.
Sampling Distribution of f Sampling Distribution of p.
E{f} = np E{p.} = p
s2{f} = np(1 - p) s2{p.} = p(1 - p)/n
s{f} = (np(1 - p))1/2 s{p.} = (p(1 - p)/n)1/2

For derivations of E{p.} = p and s2{p.} = p(1 - p)/n see NWW p. 367 (bottom).

Example: n = 7 and p = 0.25
 
E{f} = 7(0.25) = 1.75 E{p.} = 0.25 
s2{f} = 7(0.25)(0.75) = 1.3125 s2{p.} = (0.25)(0.75)/7 = 0.026786
s{f} = 1.31251/2 = 1.145644 s{p.} = 0.0267861/2 = 0.163663

The shape of the sampling distribution of f and p. is

skewed to the right if p < 0.5
skewed to the left if p > 0.5
symmetrical if p = 0.5

3.  Central Limit Theorem

Note that the sample proportion p. is a special case of the sample mean X., since p. = f/n = (B1 + ... + Bn)/n, the mean of the Bi.
Thus all the results for X. hold, including the CLT.
When the sampling distribution is skewed because p <> 0.5, the skewness decreases as n increases, as seen in the next exhibit).
Exhibit - (NWW Figure 13.2 p. 368) [m12002.gif]
CLT for f and P - The sampling distributions of f and p. are approximately normal when the size n of the simple random sample is sufficiently large.
Rule of thumb - The normal approximation is adequate when both np >= 5 and n(1 - p) >= 5.

4.  Using the Normal Approximation for f and p.

1.  Finding Probability Using the Normal Approximation for p.
When the rule of thumb for the normal approximation is satisfied, f and p. are approximated by the standardized variables
Z = (f - E{f})/s{f} = (f - np)/(np(1 - p))1/2
Z = (p. - E{p.})/s{p.} = (p. - p)/(p(1 - p)/n)1/2
Example: Suppose a sample of n = 200 from an infnite population with p = 0.7.  One wants to know the probability that p. is between 0.65 and 0.75, using the normal approximation.  Calculate
s{p.} = ((0.7)(0.3)/200)1/2 = 0.0324
Then the z values are
z = (0.65 - 0.70)/0.0324 = -1.54
z = (0.75 - 0.70)/0.0324 = +1.54
So that
P{0.65 <= p. <= 0.75} = P{-1.54 <= Z <= 1.54) = 0.876
Exhibit - (NWW Figure 13.3 p. 370) [m12003.gif]
2.  Correction for Continuity
See NWW Section 13.2 pp. 370-371.
Exhibit - (NWW Figure 13.4 p. 371) [m12004.gif]

5.  Sampling Finite Populations

All previous results apply also to finite populations if the sampling fraction n/N is 5% or less (same rule as for sampling distribution of X.).

3.  Interval Estimation of Population Proportion

1.  CI for p - Large Sample

To construct a CI for p one needs an estimate of s{p.}.  The estimated variance and standard deviation of p. are calculated as
s2{p.} = p.(1 - p.)/(n - 1)
s{p.} = (p.(1 - p.)/(n - 1))1/2
The CI is based on an extension of the CLT stating that (p. - p)/s{p.} is approximately ~N(0,1) when the size of the simple random sample is sufficiently large.
Then, when the size of the simple random sample is sufficiently large, the approximate 1 - a confidence limits for p are
p. +/- z(1 - a/2)s{p.}
where s{p.} is given by the formula above.

Example (NWW p. 375): Fish Preference.  A British food company tests consumer preference for whiting versus cod.  A sample of n = 265 consumers is given a blind taste test of both types of fish prepared in the same fashion.  Let f and p. denote the number and proportion of consumers who prefer whiting over cod.  It is found that f = 144 and p. = 144/265 = 0.5434.  The company wants a 95% CI for p.
First calculate

s{p.} = ((0.5434)(1 - 0.5434)/(264))1/2 = 0.0306567
Thus the confidence limits are
L = 0.5434 - (1.960)(0.0306567) = 0.4833
U = 0.5434 + (1.960)(0.0306567) = 0.6035
With 95% confidence the proportion of consumers who prefer whiting is between 48.33% and 60.35%.

2.  Planning of Sample Size

Under the same assumptions as before, given one can specify the sample size n required to achieve this degree of confidence and precision as
n = (z2p(1 - p))/h2
where
z = z(1 - a/2)
p is the planning value for the population proportion
Example (NWW p. 373): Worker Location.  In a study of the spatial distribution of the labor force in a large metropolitan area one wants to estimate the proportion p of workers whose place of employment is within 15 miles of their residence.  It is desired to select a random sample of size n sufficient to provide a 95% CI for p with a half-width h = 0.02.
Based on other studies, a planning value of 0.9 is chosen.
Since 1 - a = 0.95, z(1 - a/2) = z(0.975) = 1.960.
Thus
n = (1.96)2(0.9)(0.1)/(0.02)2 = 864

4.  Tests for Population Proportion

When the size of the simple random sample is sufficiently large, the test statistic
z* = (p. - p0)/s{p.}
where
s{p.} = (p0(1 - p0)/n)1/2
is approximately ~N(0,1) when p = p0.
Note that the estimate of s{p.} is based on the hypothetical value p0.

Example: Fish Preference (cont'd).  The British firm currently uses cod.  They would like to use whiting, which is cheaper, if they can reject the hypothesis that more than 50% of consumers prefer cod to whiting at the a = .05 level.  With p representing preference for whiting, the alternatives are

H0: p >= 0.5
H1: p < 0.5
Of 265 consumers in a blind taste test f = 144 preferred whiting to cod, so that p. = 144/265 = 0.5434.
(At this point, one could already reject H1 out of hand, since p. > p0, but continue for the sake of illustration.)
The test statistic is
z* = (p. - p)/s{p.} = (0.5434 - 0.5)/0.03071 = 1.41
This is a one-sided lower-tail test with P-value P{z* < 1.41} = 0.9207.
Since P-value 0.9207 > .05 one concludes H0: there is no evidence that consumers prefer cod to whiting.
Exhibit - (NWW Figure 13.5 p. 375) [m12005.gif]

5.  Power and Planning Sample Size for Test About p

1.  Type I and Type II Errors (Repeat from Module 11)

Types of Errors & Power of Test
Conclusion from Sample\True Alternative
H0
H1
H0
Correct
p = 1 - a
Type II Error
"False Negative"
p = b
H1
Type I Error
"False Positive"
p = a
Correct
p = 1 - b
= Power of Test

2.  Power of Test

1.  Rejection Probability
The power of a test is the rejection probability P(H1|p) of the test, the probability that the decision rule will lead to conclusion H1 for a given value of the population proportion p.  This probability is a function of p.

Example: Fish Preference (cont'd).  In this example p is the proportion of consumers who prefer whiting to cod.  The hypothesis tested was H0: p >= 0.5; H1: p < 0.5.
What is the probability of rejecting H0 if the actual value of p is, in fact, p = 0.35?
The situation is shown in the next exhibit.

Exhibit - (NWW Figure 13.6 p. 377) [m12006.gif]
First recover the value A of p corresponding to the critical value (action limit) of the test in the Neyman-Pearson method.  The standardized critical value was z(0.05) = -1.645; p0 is .50; s{p.} was 0.03071.  Thus the critical value is
A = p0 + z(0.05)s{p.} = 0.5 - 1.645(0.03071) = 0.4495.
When p = 0.35 the sampling distribution of p. is centered at p = 0.35 and its standard deviation is
s{p.} = (p(1 - p)/n)1/2 = (0.35(1 - 0.35)/265)1/2 = 0.02930
Note: s{p.}, being a function of p, must be recalculated for each different value of p.
The desired rejection probability is the probability that p. (given p = 0.35) lies in the rejection region of the test, i.e. to the left of A = 0.4495.  Relative to the sampling distribution for p = 0.35, p. = 0.4495 correspond to a standardized score
z = (0.4495 - 0.35)/0.02930 = 3.40
The rejection probability P(H1|p = 0.35) is the shaded area under the sampling distribution of p. to the left of 3.40 so that
P(H1|p = 0.35) = P(Z < 3.40) = 0.9997
Thus if p were, in fact, 0.35 the test is "powerful" in the sense that it would almost certainly conclude H1.
2.  Rejection Probability Curve aka Power Curve
One can calculate the rejection probability for many values of p in the same way (recalculating s{p.} for each value of p) and construct the rejection probability aka power curve shown in the next exhibit.  Note that for any value of p < .40 the probability of concluding H1 is quite high, i.e. the test is powerful.
Exhibit - (NWW Figure 13.7 p. 378) [m12007.gif]

3.  Planning of Sample Size

Using the notion of power one can determine the sample size needed to control both the a and b risks at prespecified (small) levels.
The assumptions of the planning procedure are, as before Notation: Then the required random sample size to control both the a and b risks is
n = (|z1|(p1(1 - p1))1/2 + |z0|(p0(1 - p0))1/2)2 / |p1 - p0|2
where for
One-Sided Upper-Tail Test (H0: p <= p0, H1: p > p0)
z0 = z(1 - a)  z1 = z(b)
One-Sided Lower-Tail Test (H0: p >= p0, H1: p < p0)
z0 = z(a)  z1 = z(1 - b)
Two-Sided Test (H0: p = p0, H1: p <> p0)
z0 = z(1 - a/2)  z1 = z(b)
Example: Fish Preference (cont'd).  Suppose the sample size has not yet been determined.  The sample size is determined so that (see next exhibit) Then, using the formula above, one finds (see NWW p. 380)
n = 265
which is the sample size actually used in that study.
Exhibit - (NWW Figure 13.8 p. 380) [m12008.gif]
Thus power considerations can be used in grant proposals to justify the necessity of large samples, ahead of the actual study, and therefore justify the generous funds requested for data collection and handling.
This is the spirit in which behavior geneticist Michael Neale (author of structural equations program Mx) has coined the immortal aphorism
Power is Money




Last modified 25 Oct 2002