Confidence Interval for the Difference of Two Population Proportions

Formula for confidence interval for difference of two proportions
Formula for confidence interval for difference of two proportions. C.K. Taylor

Confidence intervals are one part of inferential statistics. The basic idea behind this topic is to estimate the value of an unknown population parameter by using a statistical sample. We can not only estimate the value of a parameter, but we can also adapt our methods to estimate the difference between two related parameters. For example we may want to find the difference in the percentage of the male U.S. voting population who supports a particular piece of legislation compared to the female voting population.

We will see how to do this type of calculation by constructing a confidence interval for the difference of two population proportions. In the process we will examine some of the theory behind this calculation. We will see some similarities in how we construct a confidence interval for a single population proportion as well as a confidence interval for the difference of two population means.

Generalities

Before looking at the specific formula that we will use, let's consider the overall framework that this type of confidence interval fits into. The form of the type of confidence interval that we will look at is given by the following formula:

Estimate +/- Margin of Error

Many confidence intervals are of this type. There are two numbers that we need to calculate. The first of these values is the estimate for the parameter. The second value is the margin of error. This margin of error accounts for the fact that we do have an estimate. The confidence interval provides us with a range of possible values for our unknown parameter.

Conditions

We should make sure that all of the conditions are satisfied before doing any calculation. To find a confidence interval for the difference of two population proportions, we need to make sure that the following hold:

  • We have two simple random samples from large populations. Here "large" means that the population is at least 20 times larger than the size of the sample. The sample sizes will be denoted by n1 and n2.
  • Our individuals have been chosen independently of one another.
  • There are at least ten successes and ten failures in each of our samples.

If the last item in the list is not satisfied, then there may be a way around this. We can modify the plus-four confidence interval construction and obtain robust results. As we go forward we assume that all of the above conditions have been met.

Samples and Population Proportions

Now we are ready to construct our confidence interval. We start with the estimate for the difference between our population proportions. Both of these population proportions are estimated by a sample proportion. These sample proportions are statistics that are found by dividing the number of successes in each sample, and then dividing by the respective sample size.

The first population proportion is denoted by p1. If the number of successes in our sample from this population is k1, then we have a sample proportion of k1 / n1.

We denote this statistic by p̂1. We read this symbol as "p1-hat" because it looks like the symbol p1 with a hat on top.

In a similar way we can calculate a sample proportion from our second population. The parameter from this population is p2. If the number of successes in our sample from this population is k2, and our sample proportion is p̂2 = k2 / n2.

These two statistics become the first part of our confidence interval. The estimate of p1 is p̂1. The estimate of p2 is p̂2. So the estimate for the difference p1 - p2 is p̂1 - p̂2.

Sampling Distribution of the Difference of Sample Proportions

Next we need to obtain the formula for the margin of error. To do this we will first consider the  sampling distribution of p̂. This is a binomial distribution with probability of success p1 and n1 trials. The mean of this distribution is the proportion p1. The standard deviation of this type of random variable has variance of p(1 - p)/n1.

The sampling distribution of p̂2 is similar to that of p̂. Simply change all of the indices from 1 to 2 and we have a binomial distribution with mean of p2 and variance of p2 (1 - p2 )/n2.

We now need a few results from mathematical statistics in order to determine the sampling distribution of p̂1 - p̂2. The mean of this distribution is p1 - p2. Due to the fact that the variances add together, we see that the variance of the sampling distribution is p(1 - p)/n1 + p2 (1 - p2 )/n2. The standard deviation of the distribution is the square root of this formula.

There are a couple of adjustments that we need to make. The first is that the formula for the standard deviation of p̂1 - p̂2 uses the unknown parameters of p1 and p2. Of course if we really knew these values, then it would not be an interesting statistical problem at all. We would not need to estimate the difference between p1 and p2.. Instead we could simply calculate the exact difference.

This problem can be fixed by calculating a standard error rather than a standard deviation. All that we need to do is to replace the population proportions by sample proportions. Standard errors are calculated from upon statistics instead of parameters. A standard error is useful because it effectively estimates a standard deviation. What this means for us is that we no longer need to know the value of the parameters p1 and p2.Since these sample proportions are known, the standard error is given by the square root of the following expression:

1 (1 - p̂1 )/n1 + p̂2 (1 - p̂2 )/n2.

The second item that we need to address is the particular form of our sampling distribution. It turns out that we can use a normal distribution to approximate the sampling distribution of p̂- p̂2. The reason for this is somewhat technical, but is outlined in the next paragraph. 

Both p̂1 and p̂have a sampling distribution that is binomial. Each of these binomial distributions may be approximated quite well by a normal distribution. Thus p̂- p̂2 is a random variable. It is formed as a linear combination of two random variables. Each of these are approximated by a normal distribution. Therefore the sampling distribution of p̂- p̂2 is also normally distributed.

Confidence Interval Formula

We now have everything we need to assemble our confidence interval. The estimate is (p̂1 - p̂2) and the margin of error is z* [ 1 (1 - p̂1 )/n1 + p̂2 (1 - p̂2 )/n2.]0.5. The value that we enter for z* is dictated by the level of confidence C.  Commonly used values for z* are 1.645 for 90% confidence and 1.96 for 95% confidence. These values for z* denote the portion of the standard normal distribution where exactly C percent of the distribution is between -z* and z*. 

The following formula gives us a confidence interval for the difference of two population proportions:

(p̂1 - p̂2) +/- z* [ 1 (1 - p̂1 )/n1 + p̂2 (1 - p̂2 )/n2.]0.5

Format
mla apa chicago
Your Citation
Taylor, Courtney. "Confidence Interval for the Difference of Two Population Proportions." ThoughtCo, Apr. 5, 2023, thoughtco.com/difference-of-two-population-proportions-4061672. Taylor, Courtney. (2023, April 5). Confidence Interval for the Difference of Two Population Proportions. Retrieved from https://www.thoughtco.com/difference-of-two-population-proportions-4061672 Taylor, Courtney. "Confidence Interval for the Difference of Two Population Proportions." ThoughtCo. https://www.thoughtco.com/difference-of-two-population-proportions-4061672 (accessed April 18, 2024).