The first dataset we will use for examples in the following tutorial is from surveys of elephant populations in Tanzania, originally collected by M. Chase and colleagues,(Reference: Chase et al. 2016, Continent-wide survey reveals massive decline in African savannah elephants, PeerJ. 2016 Aug 31;4:e2354.).
The variables in the dataset include:
Protected
: Whether the habitat surveyed is protected or notDensity
: Number of elephants per unit areaHighDensity
: ‘Yes’ if there is more than 1 elephant per square km, and ‘No’ if there is not.We will also use a dataset on English robins from a 2007 paper by R. Fuller and colleagues (Reference: Fuller et al. 2007, Daytime noise predicts nocturnal singing in urban robins). This dataset includes the variables below, observed at n=121 different urban sites in Sheffield, England:
DaySong
Whether robin song was Present
or Absent
during the dayNightSong
Whether robin song was Present
or Absent
during the nightYear
of data collection (2005 or 2006)DayNoise
Noise level recorded during the dayNightNoise
Noise level recorded during the nightDiurnalDiff
For each site, the night time noise level minus the day time noise levelRecall that we can use the function tally()
to compute sample proportions. For example, if we wanted to find the number of protected and unprotected sites at which elephants were surveyed, we could run:
tally(~ Protected, data=elephants)
## Protected
## No Yes
## 17 24
We could also format the same information as proportions rather than counts:
tally(~ Protected , data=elephants, format='prop')
## Protected
## No Yes
## 0.4146341 0.5853659
Try computing the number of observations in 2005 and 2006 (variable Year
) in the robins
dataset.
tally(~..., data=...)
tally(~, data=robins)
tally(~Year, data=robins)
What about computing the same table with proportions rather than counts as the output?
tally(~..., data=..., format=...)
tally(~, data=robins, format='prop')
tally(~Year, data=robins, format='prop')
What if we want to use prop.test()
to find a confidence interval for the proportion elephant survey sites that are protected, using the summary information we just computed?
tally(~Protected, data=elephants)
## Protected
## No Yes
## 17 24
We need to provide 2 or 3 inputs to prop.test()
to obtain a CI:
x
, the number of “successes”. Here we consider a “success” a site that is protected, so there are 24 such sitesn
, the sample size. Here it is 17+24 = 41conf.level
, the desired confidence level as a proportion (so for 95% confidence, we would use conf.level=0.95
). This input is optional; if you leave it out, it will default to 0.95.For example,
prop.test(x=24, n=41, conf.level=0.99)
##
## 1-sample proportions test with continuity correction
##
## data: 24 out of 41
## X-squared = 0.87805, df = 1, p-value = 0.3487
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
## 0.3781014 0.7674946
## sample estimates:
## p
## 0.5853659
We are 99% confident that the true proportion elephant survey sites that are protected is between 0.378 and 0.767.
Your turn to compute a CI with summary data. Find a 98% CI for the proportion sites with day song in the robins
dataset. First, be sure to check the tally()
:
tally(~DaySong, data=robins)
## DaySong
## Absent Present
## 54 67
prop.test(x=...,n=..., conf.level=0.98)
prop.test(x=67,n=54+67, conf.level=0.98)
To carry out a test instead of computing a CI, we just have to specify our hypotheses. Some elephant management plans have set a goal to keep elephant density low (less than 1 animal per \(km^2\)) to prevent vegetation damage and habitat change. Imagine Tanzania set a goal to keep elephant density low at at least half of sites; is this goal being met?Let \(p\) be the proportion sites with HighDensity
TRUE in Tanzania; we want to test:
\[H_0: p = 0.5\] \[H_1: p > 0.5\] The sample data indicate:
tally(~HighDensity, data=elephants)
## HighDensity
## No Yes
## 22 19
To carry out the test with prop.test
, we need to specify just two more input arguments:
p
, the null hypothesis proportion (here 0.5)alternative
, which can be either ‘two.sided’ (the default if you leave this input out), ‘greater’ (one-sided test with \(>\)), or ‘less’ (one-sided test with \(<\)). -Note we can now leave out conf.level
as we are trying to obtain a p-value, not a CI.Our test:
prop.test(x=19, n=19+22, p=0.5,
alternative='greater')
##
## 1-sample proportions test with continuity correction
##
## data: 19 out of 19 + 22
## X-squared = 0.097561, df = 1, p-value = 0.6226
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.3303934 1.0000000
## sample estimates:
## p
## 0.4634146
Since the p-value is large, we can not reject \(H_0\) - based on these data it is reasonable to believe that elephant density is high in no more than half of sites.
What would you change to test whether the proportion low-density sites is no less than 0.5? You will still test \(H_0: p=0.5\), but now p is the true proportion low density sites.
Use prop.test
to carry out the test with the summary data
tally(~HighDensity, data=elephants)
## HighDensity
## No Yes
## 22 19
prop.test(x=22,n=22+19, p=0.5
alternative='less')
With the same p-value as before, you will again fail to reject the null!
We can also (perhaps more usefully) compute CIs directly from a dataset using prop.test
. For example, to find a 95% CI for the proportion high-density sites, we use the formula interface and the appropriate variable and dataset names. We also need to provide the input success
, which gives the value of the variable of interest that we are interested in computing the proportion of.
prop.test(~HighDensity, data=elephants,
success='Yes', conf.level=0.95)
##
## 1-sample proportions test with continuity correction
##
## data: elephants$HighDensity [with success = Yes]
## X-squared = 0.097561, df = 1, p-value = 0.7548
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3096915 0.6238850
## sample estimates:
## p
## 0.4634146
We are 95% confident that the true proportion high density sites is between 0.31 and 0.62.
Your turn: compute a 99% CI for the proportion low-density sites.
prop.test(~..., data=..., success=...,
conf.level=...)
prop.test(~..., data=..., success='No',
conf.level=0.99)
prop.test(~HighDensity, data=elephants, success='No',
conf.level=0.99)
How would you state your conclusion in context?
To do a test instead, we simply need to specify p
as we did with the summary data – p
is the true proportion according to the null hypothesis. How would you use the full elephants
dataset to test whether the true proportion high-density sites is no more than 0.5?
prop.test(~..., data=elephants, p=...,
alternative=..., success=...)
prop.test(~HighDensity, data=elephants, p=0.5,
alternative=..., success=...)
prop.test(~HighDensity, data=elephants, p=0.5,
alternative='greater', success='Yes')
It is possible to compute a CI for a difference in two proportions using prop.test
with summary information. For example, to find a 95% CI for the difference in proportion Sheffield sites with robin song in 2005 vs 2006, we can examine the data:
tally(~DaySong | Year, data=robins)
## Year
## DaySong 2005 2006
## Absent 22 32
## Present 35 32
To pass this information to prop.test
, we now need x
to have two entries (the number of “successes” in 2005 and 2006) and n
also needs to have two entries (the sample size in each group). We would run:
prop.test(x=c(35,32), n=c(22+35, 32+32),
conf.level=0.95)
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(35, 32) out of c(22 + 35, 32 + 32)
## X-squared = 1.1586, df = 1, p-value = 0.2818
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.07855448 0.30662466
## sample estimates:
## prop 1 prop 2
## 0.6140351 0.5000000
Perhaps more simply, we can use the full dataset to make the same calculation:
prop.test(~DaySong | Year, data=robins,
conf.level=0.95, success='Present')
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: tally(DaySong ~ Year)
## X-squared = 1.1586, df = 1, p-value = 0.2818
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.30662466 0.07855448
## sample estimates:
## prop 1 prop 2
## 0.3859649 0.5000000
How would you compute a 95% CI for the difference in proportion sites with Night song in 2005 vs 2006?
prop.test(~... | ..., data=..., conf.level=...,
success=...)
prop.test(~... | ..., data=robins, conf.level=...,
success='Present')
prop.test(~NightSong | Year, data=robins, conf.level=0.95,
success='Present')
To carry out tests instead of the CIs above, we just have to specify our hypotheses to test, and the include the input alternative
(either ‘two.sided’, the default; or ‘less’ or ‘greater’). The default option (and really the only option we will use) is to test whether the proportions in the two groups are equal.
How would you test the hypotheses below? Let \(p_{N2005}\) be the proportion sites with night song in 2005 and \(p_{N2006}\) be the proportion sites with night song in 2006.
prop.test(~...|..., data=...,
alternative=..., success=...)
prop.test(~NightSong|Year, data=robins,
alternative=..., success=...)
prop.test(~NightSong|Year, data=robins,
success=...)
prop.test(~NightSong|Year, data=robins,
alternative='two.sided')
What do you conclude? Is the proportion night song different between two years? (This might be important if the researchers wanted to pool data from the two years and use it together to answer other questions…)
What if our sample statistic is a mean rather than a proportion? For example, what if we wanted to find a 95% confidence interval for the mean elephant density in Tanzania using the elephant data? We can use the function t.test()
. The inputs will be:
data
(the name of the dataset, as usual)conf.level
(the desired confidence level)So we would run:
t.test(~Density, data=elephants, conf.level=0.95)
## ~Density
## <environment: 0x00000000192a66d0>
##
## One Sample t-test
##
## data: Density
## t = 2.8077, df = 40, p-value = 0.007677
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.568758 3.491488
## sample estimates:
## mean of x
## 2.030123
and then we could say we are 95% confident that the true mean density of elephants at Tanzania sites is between 0.569 and 3.49 elephants per \(km^2\).
How would you find a 90% t interval for the mean noise level (variable NightNoise
) in the robins
dataset?
t.test(~..., data=..., conf.level=...)
t.test(~NightNoise, data=robins, conf.level=...)
t.test(~NightNoise, data=robins, conf.level=0.90)
What if, instead of finding a CI, we wanted to test the null hypothesis that the true mean elephant density was above 1 per square km? If \(\mu\) is the true mean elephant density, we would test \(H_0: \mu = 1\) versus the alternative \(H_A: \mu > 1\). To do this with t.test
, we have to specify the additional inputs:
mu
, the mean under \(H_0\)alternative
(as before, it can be either ‘two.sided’, the default, or it can be ‘less’ or ‘greater’)t.test(~Density, data=elephants, mu=1,
alternative='greater')
## ~Density
## <environment: 0x00000000192a66d0>
##
## One Sample t-test
##
## data: Density
## t = 1.4247, df = 40, p-value = 0.081
## alternative hypothesis: true mean is greater than 1
## 95 percent confidence interval:
## 0.8125925 Inf
## sample estimates:
## mean of x
## 2.030123
With a p-value of 0.081, we have only weak evidence (if any) against the null hypothesis. We would probably fail to reject the null and conclude that it’s reasonable to believe that the true density is 1 (or less) elephants per \(km^2\).
Your turn: How would you use the robins
data to test the hypotheses below? Let \(\mu\) be the true mean noise level during the day. According to the Handbook of Environmental Noise, the typical suburban area has background noise levels of about 50dBA. Does Sheffield match this prediction?
\[H_0: \mu = 50\] \[H_A: \mu \neq 50\]
t.test(~..., data=..., mu=..., alternative=...)
t.test(~..., data=robins, mu=...,
alternative='two.sided')
t.test(~DayNoise, data=robins, mu=50, alternative='two.sided')
What would you conclude?
We can use very similar code to compute a CI for a difference in means for paired data - we just need an appropriate dataset where the data is a list of differences for which we want to compute the mean. We might find a 95% CI for the difference in noise day minus night in the robins
data, using the variable DiurnalDiff
:
t.test(~DiurnalDiff, data=robins, conf.level=0.95)
## ~DiurnalDiff
## <environment: 0x00000000192a66d0>
##
## One Sample t-test
##
## data: DiurnalDiff
## t = 18.776, df = 120, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 7.593923 9.384259
## sample estimates:
## mean of x
## 8.489091
We would find that we were 95% percent confident that the true difference between night and day, over all sites, is between 7.59 and 9.38 dB.
How would you use t.test()
with the same data to do a paired t-test of the hypotheses \(H_0: \mu_D = 0\) vs. \(H_A: \mu_D \neq 0\), where \(\mu_D\) is the true population mean of the day-night noise level differences?
t.test(~..., data=..., mu=...,
alternative=...)
t.test(~DiurnalDiffs, data=robins, mu=0,
alternative=...)
t.test(~DiurnalDiff, data=robins, mu=0,
alternative='two.sided')
What if we want to find a CI for a difference between the means of two groups? The only change from before is that we need to alter the formula to have the form: ** (tilde) MainVariableName (vertical line) GroupVariable, where MainVariableName** is the name of the variable of which you want to compute the mean, and GroupVariable is the name of a categorical variable that divides the data into 2 groups.
For example, we might want to find a 98% CI for the difference in DayNoise
levels between the two years 2005 and 2006. How would you do it?
t.test(~... | ..., data=..., conf.level=...)
t.test(~DayNoise | Year, data= robins,
conf.level=0.98)
We are 95% confident that the true difference in day noise levels by year is between -2.39 and 2.73. Since 0 is in the interval, we know that we would fail to reject the null hypothesis of no difference between years at the \(\alpha=0.02\) level, if we were to do the test. Let’s try…
To convert the example above to a test, we just need to specify the alternative
and mu
, the difference in means according to the null hypothesis.
t.test(~...|..., data=..., mu=...,
alternative=...)
t.test(~DayNoise|Year, data=robins, mu=0,
alternative='two.sided')