Tests for Proportions and Means in R

Data

Elephants

The first dataset we will use for examples in the following tutorial is from surveys of elephant populations in Tanzania, originally collected by M. Chase and colleagues,(Reference: Chase et al. 2016, Continent-wide survey reveals massive decline in African savannah elephants, PeerJ. 2016 Aug 31;4:e2354.).

The variables in the dataset include:

Protected: Whether the habitat surveyed is protected or not
Density: Number of elephants per unit area
HighDensity: ‘Yes’ if there is more than 1 elephant per square km, and ‘No’ if there is not.

English Robins

We will also use a dataset on English robins from a 2007 paper by R. Fuller and colleagues (Reference: Fuller et al. 2007, Daytime noise predicts nocturnal singing in urban robins). This dataset includes the variables below, observed at n=121 different urban sites in Sheffield, England:

DaySong Whether robin song was Present or Absent during the day
NightSong Whether robin song was Present or Absent during the night
Year of data collection (2005 or 2006)
DayNoise Noise level recorded during the day
NightNoise Noise level recorded during the night
DiurnalDiff For each site, the night time noise level minus the day time noise level

One Proportion

Computing Summary Statistics

Recall that we can use the function tally() to compute sample proportions. For example, if we wanted to find the number of protected and unprotected sites at which elephants were surveyed, we could run:

tally(~ Protected, data=elephants)

## Protected
##  No Yes 
##  17  24

We could also format the same information as proportions rather than counts:

tally(~ Protected , data=elephants, format='prop')

## Protected
##        No       Yes 
## 0.4146341 0.5853659

Practice: tally()

Try computing the number of observations in 2005 and 2006 (variable Year) in the robins dataset.

tally(~..., data=...)

tally(~, data=robins)

tally(~Year, data=robins)

Practice: tally() with proportion output

What about computing the same table with proportions rather than counts as the output?

tally(~..., data=..., format=...)

tally(~, data=robins, format='prop')

tally(~Year, data=robins, format='prop')

1 proportion CI with summary data

What if we want to use prop.test() to find a confidence interval for the proportion elephant survey sites that are protected, using the summary information we just computed?

tally(~Protected, data=elephants)

## Protected
##  No Yes 
##  17  24

We need to provide 2 or 3 inputs to prop.test() to obtain a CI:

x, the number of “successes”. Here we consider a “success” a site that is protected, so there are 24 such sites
n, the sample size. Here it is 17+24 = 41
conf.level, the desired confidence level as a proportion (so for 95% confidence, we would use conf.level=0.95). This input is optional; if you leave it out, it will default to 0.95.

For example,

prop.test(x=24, n=41, conf.level=0.99)

## 
##  1-sample proportions test with continuity correction
## 
## data:  24 out of 41
## X-squared = 0.87805, df = 1, p-value = 0.3487
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
##  0.3781014 0.7674946
## sample estimates:
##         p 
## 0.5853659

We are 99% confident that the true proportion elephant survey sites that are protected is between 0.378 and 0.767.

Practice: one-proportion CI

Your turn to compute a CI with summary data. Find a 98% CI for the proportion sites with day song in the robins dataset. First, be sure to check the tally():

tally(~DaySong, data=robins)

## DaySong
##  Absent Present 
##      54      67

prop.test(x=...,n=..., conf.level=0.98)

prop.test(x=67,n=54+67, conf.level=0.98)

1 proportion test with summary data

To carry out a test instead of computing a CI, we just have to specify our hypotheses. Some elephant management plans have set a goal to keep elephant density low (less than 1 animal per \(km^2\)) to prevent vegetation damage and habitat change. Imagine Tanzania set a goal to keep elephant density low at at least half of sites; is this goal being met?Let \(p\) be the proportion sites with HighDensity TRUE in Tanzania; we want to test:

\[H_0: p = 0.5\] \[H_1: p > 0.5\] The sample data indicate:

tally(~HighDensity, data=elephants)

## HighDensity
##  No Yes 
##  22  19

To carry out the test with prop.test, we need to specify just two more input arguments:

p, the null hypothesis proportion (here 0.5)
alternative, which can be either ‘two.sided’ (the default if you leave this input out), ‘greater’ (one-sided test with \(>\)), or ‘less’ (one-sided test with \(<\)). -Note we can now leave out conf.level as we are trying to obtain a p-value, not a CI.

Our test:

prop.test(x=19, n=19+22, p=0.5,
          alternative='greater')

## 
##  1-sample proportions test with continuity correction
## 
## data:  19 out of 19 + 22
## X-squared = 0.097561, df = 1, p-value = 0.6226
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.3303934 1.0000000
## sample estimates:
##         p 
## 0.4634146

Since the p-value is large, we can not reject \(H_0\) - based on these data it is reasonable to believe that elephant density is high in no more than half of sites.

Practice - one proportion test

What would you change to test whether the proportion low-density sites is no less than 0.5? You will still test \(H_0: p=0.5\), but now p is the true proportion low density sites.

Use prop.test to carry out the test with the summary data

tally(~HighDensity, data=elephants)

## HighDensity
##  No Yes 
##  22  19

prop.test(x=22,n=22+19, p=0.5
          alternative='less')

With the same p-value as before, you will again fail to reject the null!

1 proportion CI with full dataset

We can also (perhaps more usefully) compute CIs directly from a dataset using prop.test. For example, to find a 95% CI for the proportion high-density sites, we use the formula interface and the appropriate variable and dataset names. We also need to provide the input success, which gives the value of the variable of interest that we are interested in computing the proportion of.

prop.test(~HighDensity, data=elephants,
          success='Yes', conf.level=0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  elephants$HighDensity  [with success = Yes]
## X-squared = 0.097561, df = 1, p-value = 0.7548
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.3096915 0.6238850
## sample estimates:
##         p 
## 0.4634146

We are 95% confident that the true proportion high density sites is between 0.31 and 0.62.

Practice

Your turn: compute a 99% CI for the proportion low-density sites.

prop.test(~..., data=..., success=..., 
          conf.level=...)

prop.test(~..., data=..., success='No', 
          conf.level=0.99)

prop.test(~HighDensity, data=elephants, success='No', 
          conf.level=0.99)

How would you state your conclusion in context?

1 proportion test with full dataset

To do a test instead, we simply need to specify p as we did with the summary data – p is the true proportion according to the null hypothesis. How would you use the full elephants dataset to test whether the true proportion high-density sites is no more than 0.5?

prop.test(~..., data=elephants, p=...,
          alternative=..., success=...)

prop.test(~HighDensity, data=elephants, p=0.5,
          alternative=..., success=...)

prop.test(~HighDensity, data=elephants, p=0.5,
          alternative='greater', success='Yes')

2 proportion CI

It is possible to compute a CI for a difference in two proportions using prop.test with summary information. For example, to find a 95% CI for the difference in proportion Sheffield sites with robin song in 2005 vs 2006, we can examine the data:

tally(~DaySong | Year, data=robins)

##          Year
## DaySong   2005 2006
##   Absent    22   32
##   Present   35   32

To pass this information to prop.test, we now need x to have two entries (the number of “successes” in 2005 and 2006) and n also needs to have two entries (the sample size in each group). We would run:

prop.test(x=c(35,32), n=c(22+35, 32+32),
          conf.level=0.95)

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(35, 32) out of c(22 + 35, 32 + 32)
## X-squared = 1.1586, df = 1, p-value = 0.2818
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.07855448  0.30662466
## sample estimates:
##    prop 1    prop 2 
## 0.6140351 0.5000000

Perhaps more simply, we can use the full dataset to make the same calculation:

prop.test(~DaySong | Year, data=robins,
          conf.level=0.95, success='Present')

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  tally(DaySong ~ Year)
## X-squared = 1.1586, df = 1, p-value = 0.2818
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.30662466  0.07855448
## sample estimates:
##    prop 1    prop 2 
## 0.3859649 0.5000000

Practice

How would you compute a 95% CI for the difference in proportion sites with Night song in 2005 vs 2006?

prop.test(~... | ..., data=..., conf.level=...,
          success=...)

prop.test(~... | ..., data=robins, conf.level=...,
          success='Present')

prop.test(~NightSong | Year, data=robins, conf.level=0.95,
          success='Present')

2 proportion test with full dataset

To carry out tests instead of the CIs above, we just have to specify our hypotheses to test, and the include the input alternative (either ‘two.sided’, the default; or ‘less’ or ‘greater’). The default option (and really the only option we will use) is to test whether the proportions in the two groups are equal.

How would you test the hypotheses below? Let \(p_{N2005}\) be the proportion sites with night song in 2005 and \(p_{N2006}\) be the proportion sites with night song in 2006.

prop.test(~...|..., data=..., 
          alternative=..., success=...)

prop.test(~NightSong|Year, data=robins, 
          alternative=..., success=...)

prop.test(~NightSong|Year, data=robins,
          success=...)

prop.test(~NightSong|Year, data=robins, 
          alternative='two.sided')

What do you conclude? Is the proportion night song different between two years? (This might be important if the researchers wanted to pool data from the two years and use it together to answer other questions…)

One-sample t interval

What if our sample statistic is a mean rather than a proportion? For example, what if we wanted to find a 95% confidence interval for the mean elephant density in Tanzania using the elephant data? We can use the function t.test(). The inputs will be:

a formula (giving the name of the variable for which we want to compute the mean, in the form (tilde) VariableName)
data (the name of the dataset, as usual)
conf.level (the desired confidence level)

So we would run:

t.test(~Density, data=elephants, conf.level=0.95)

## ~Density
## <environment: 0x00000000192a66d0>

## 
##  One Sample t-test
## 
## data:  Density
## t = 2.8077, df = 40, p-value = 0.007677
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.568758 3.491488
## sample estimates:
## mean of x 
##  2.030123

and then we could say we are 95% confident that the true mean density of elephants at Tanzania sites is between 0.569 and 3.49 elephants per \(km^2\).

Practice

How would you find a 90% t interval for the mean noise level (variable NightNoise) in the robins dataset?

t.test(~..., data=..., conf.level=...)

t.test(~NightNoise, data=robins, conf.level=...)

t.test(~NightNoise, data=robins, conf.level=0.90)

One-sample t-test

What if, instead of finding a CI, we wanted to test the null hypothesis that the true mean elephant density was above 1 per square km? If \(\mu\) is the true mean elephant density, we would test \(H_0: \mu = 1\) versus the alternative \(H_A: \mu > 1\). To do this with t.test, we have to specify the additional inputs:

mu, the mean under \(H_0\)
alternative (as before, it can be either ‘two.sided’, the default, or it can be ‘less’ or ‘greater’)

t.test(~Density, data=elephants, mu=1,
       alternative='greater')

## ~Density
## <environment: 0x00000000192a66d0>

## 
##  One Sample t-test
## 
## data:  Density
## t = 1.4247, df = 40, p-value = 0.081
## alternative hypothesis: true mean is greater than 1
## 95 percent confidence interval:
##  0.8125925       Inf
## sample estimates:
## mean of x 
##  2.030123

With a p-value of 0.081, we have only weak evidence (if any) against the null hypothesis. We would probably fail to reject the null and conclude that it’s reasonable to believe that the true density is 1 (or less) elephants per \(km^2\).

Practice

Your turn: How would you use the robins data to test the hypotheses below? Let \(\mu\) be the true mean noise level during the day. According to the Handbook of Environmental Noise, the typical suburban area has background noise levels of about 50dBA. Does Sheffield match this prediction?

\[H_0: \mu = 50\] \[H_A: \mu \neq 50\]

t.test(~..., data=..., mu=..., alternative=...)

t.test(~..., data=robins, mu=..., 
       alternative='two.sided')

t.test(~DayNoise, data=robins, mu=50, alternative='two.sided')

What would you conclude?

Paired t interval

We can use very similar code to compute a CI for a difference in means for paired data - we just need an appropriate dataset where the data is a list of differences for which we want to compute the mean. We might find a 95% CI for the difference in noise day minus night in the robins data, using the variable DiurnalDiff:

t.test(~DiurnalDiff, data=robins, conf.level=0.95)

## ~DiurnalDiff
## <environment: 0x00000000192a66d0>

## 
##  One Sample t-test
## 
## data:  DiurnalDiff
## t = 18.776, df = 120, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  7.593923 9.384259
## sample estimates:
## mean of x 
##  8.489091

We would find that we were 95% percent confident that the true difference between night and day, over all sites, is between 7.59 and 9.38 dB.

Paired t-test

How would you use t.test() with the same data to do a paired t-test of the hypotheses \(H_0: \mu_D = 0\) vs. \(H_A: \mu_D \neq 0\), where \(\mu_D\) is the true population mean of the day-night noise level differences?

t.test(~..., data=..., mu=...,
       alternative=...)

t.test(~DiurnalDiffs, data=robins, mu=0,
       alternative=...)

t.test(~DiurnalDiff, data=robins, mu=0,
       alternative='two.sided')

Two-sample t interval

What if we want to find a CI for a difference between the means of two groups? The only change from before is that we need to alter the formula to have the form: ** (tilde) MainVariableName (vertical line) GroupVariable, where MainVariableName** is the name of the variable of which you want to compute the mean, and GroupVariable is the name of a categorical variable that divides the data into 2 groups.

For example, we might want to find a 98% CI for the difference in DayNoise levels between the two years 2005 and 2006. How would you do it?

t.test(~... | ..., data=..., conf.level=...)

t.test(~DayNoise | Year, data= robins,
       conf.level=0.98)

We are 95% confident that the true difference in day noise levels by year is between -2.39 and 2.73. Since 0 is in the interval, we know that we would fail to reject the null hypothesis of no difference between years at the \(\alpha=0.02\) level, if we were to do the test. Let’s try…

Two-sample t-test

To convert the example above to a test, we just need to specify the alternative and mu, the difference in means according to the null hypothesis.

t.test(~...|..., data=..., mu=...,
       alternative=...)

t.test(~DayNoise|Year, data=robins, mu=0,
       alternative='two.sided')

As we expected, the p-value is very large (0.88), so we fail to reject the null - it is reasonable to think the noise levels are the same between years.