The first dataset we will use for examples in the following tutorial is from surveys of elephant populations in Tanzania, originally collected by M. Chase and colleagues,(Reference: Chase et al. 2016, *Continent-wide survey reveals massive decline in African savannah elephants,* PeerJ. 2016 Aug 31;4:e2354.).

The variables in the dataset include:

`Protected`

: Whether the habitat surveyed is protected or not`Density`

: Number of elephants per unit area`HighDensity`

: ‘Yes’ if there is more than 1 elephant per square km, and ‘No’ if there is not.

We will also use a dataset on English robins from a 2007 paper by R. Fuller and colleagues (Reference: Fuller et al. 2007, *Daytime noise predicts nocturnal singing in urban robins*). This dataset includes the variables below, observed at n=121 different urban sites in Sheffield, England:

`DaySong`

Whether robin song was`Present`

or`Absent`

during the day`NightSong`

Whether robin song was`Present`

or`Absent`

during the night`Year`

of data collection (2005 or 2006)`DayNoise`

Noise level recorded during the day`NightNoise`

Noise level recorded during the night`DiurnalDiff`

For each site, the night time noise level minus the day time noise level

Recall that we can use the function `tally()`

to compute sample proportions. For example, if we wanted to find the number of protected and unprotected sites at which elephants were surveyed, we could run:

`tally(~ Protected, data=elephants)`

```
## Protected
## No Yes
## 17 24
```

We could also format the same information as proportions rather than counts:

`tally(~ Protected , data=elephants, format='prop')`

```
## Protected
## No Yes
## 0.4146341 0.5853659
```

Try computing the number of observations in 2005 and 2006 (variable `Year`

) in the `robins`

dataset.

`tally(~..., data=...)`

`tally(~, data=robins)`

`tally(~Year, data=robins)`

What about computing the same table with proportions rather than counts as the output?

`tally(~..., data=..., format=...)`

`tally(~, data=robins, format='prop')`

`tally(~Year, data=robins, format='prop')`

What if we want to use `prop.test()`

to find a confidence interval for the proportion elephant survey sites that are protected, using the summary information we just computed?

`tally(~Protected, data=elephants)`

```
## Protected
## No Yes
## 17 24
```

We need to provide 2 or 3 inputs to `prop.test()`

to obtain a CI:

`x`

, the number of “successes”. Here we consider a “success” a site that*is*protected, so there are 24 such sites`n`

, the sample size. Here it is 17+24 = 41`conf.level`

, the desired confidence level as a proportion (so for 95% confidence, we would use`conf.level=0.95`

). This input is optional; if you leave it out, it will default to 0.95.

For example,

`prop.test(x=24, n=41, conf.level=0.99)`

```
##
## 1-sample proportions test with continuity correction
##
## data: 24 out of 41
## X-squared = 0.87805, df = 1, p-value = 0.3487
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
## 0.3781014 0.7674946
## sample estimates:
## p
## 0.5853659
```

We are 99% confident that the true proportion elephant survey sites that are protected is between 0.378 and 0.767.

Your turn to compute a CI with summary data. Find a **98% CI** for the proportion sites with day song in the `robins`

dataset. First, be sure to check the `tally()`

:

`tally(~DaySong, data=robins)`

```
## DaySong
## Absent Present
## 54 67
```

`prop.test(x=...,n=..., conf.level=0.98)`

`prop.test(x=67,n=54+67, conf.level=0.98)`

To carry out a test instead of computing a CI, we just have to specify our hypotheses. Some elephant management plans have set a goal to keep elephant density low (less than 1 animal per \(km^2\)) to prevent vegetation damage and habitat change. Imagine Tanzania set a goal to keep elephant density low at at least half of sites; is this goal being met?Let \(p\) be the proportion sites with `HighDensity`

TRUE in Tanzania; we want to test:

\[H_0: p = 0.5\] \[H_1: p > 0.5\] The sample data indicate:

`tally(~HighDensity, data=elephants)`

```
## HighDensity
## No Yes
## 22 19
```

To carry out the test with `prop.test`

, we need to specify just two more input arguments:

`p`

, the null hypothesis proportion (here 0.5)`alternative`

, which can be either ‘two.sided’ (the default if you leave this input out), ‘greater’ (one-sided test with \(>\)), or ‘less’ (one-sided test with \(<\)). -Note we can now leave out`conf.level`

as we are trying to obtain a p-value, not a CI.

Our test:

```
prop.test(x=19, n=19+22, p=0.5,
alternative='greater')
```

```
##
## 1-sample proportions test with continuity correction
##
## data: 19 out of 19 + 22
## X-squared = 0.097561, df = 1, p-value = 0.6226
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.3303934 1.0000000
## sample estimates:
## p
## 0.4634146
```

Since the p-value is large, we can not reject \(H_0\) - based on these data it is reasonable to believe that elephant density is high in no more than half of sites.

What would you change to test whether the proportion low-density sites is no less than 0.5? You will still test \(H_0: p=0.5\), but now p is the true proportion low density sites.

Use `prop.test`

to carry out the test with the summary data

`tally(~HighDensity, data=elephants)`

```
## HighDensity
## No Yes
## 22 19
```

```
prop.test(x=22,n=22+19, p=0.5
alternative='less')
```

With the same p-value as before, you will again fail to reject the null!

We can also (perhaps more usefully) compute CIs directly from a dataset using `prop.test`

. For example, to find a 95% CI for the proportion high-density sites, we use the formula interface and the appropriate variable and dataset names. We also need to provide the input `success`

, which gives the value of the variable of interest that we are interested in computing the proportion of.

```
prop.test(~HighDensity, data=elephants,
success='Yes', conf.level=0.95)
```

```
##
## 1-sample proportions test with continuity correction
##
## data: elephants$HighDensity [with success = Yes]
## X-squared = 0.097561, df = 1, p-value = 0.7548
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3096915 0.6238850
## sample estimates:
## p
## 0.4634146
```

We are 95% confident that the true proportion high density sites is between 0.31 and 0.62.

Your turn: compute a 99% CI for the proportion low-density sites.

```
prop.test(~..., data=..., success=...,
conf.level=...)
```

```
prop.test(~..., data=..., success='No',
conf.level=0.99)
```

```
prop.test(~HighDensity, data=elephants, success='No',
conf.level=0.99)
```

How would you state your conclusion in context?

To do a test instead, we simply need to specify `p`

as we did with the summary data – `p`

is the true proportion according to the null hypothesis. How would you use the full `elephants`

dataset to test whether the true proportion high-density sites is no more than 0.5?

```
prop.test(~..., data=elephants, p=...,
alternative=..., success=...)
```

```
prop.test(~HighDensity, data=elephants, p=0.5,
alternative=..., success=...)
```

```
prop.test(~HighDensity, data=elephants, p=0.5,
alternative='greater', success='Yes')
```

It is possible to compute a CI for a difference in two proportions using `prop.test`

with summary information. For example, to find a 95% CI for the difference in proportion Sheffield sites with robin song in 2005 vs 2006, we can examine the data:

`tally(~DaySong | Year, data=robins)`

```
## Year
## DaySong 2005 2006
## Absent 22 32
## Present 35 32
```

To pass this information to `prop.test`

, we now need `x`

to have two entries (the number of “successes” in 2005 and 2006) and `n`

also needs to have two entries (the sample size in each group). We would run:

```
prop.test(x=c(35,32), n=c(22+35, 32+32),
conf.level=0.95)
```

```
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(35, 32) out of c(22 + 35, 32 + 32)
## X-squared = 1.1586, df = 1, p-value = 0.2818
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.07855448 0.30662466
## sample estimates:
## prop 1 prop 2
## 0.6140351 0.5000000
```

Perhaps more simply, we can use the full dataset to make the same calculation:

```
prop.test(~DaySong | Year, data=robins,
conf.level=0.95, success='Present')
```

```
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: tally(DaySong ~ Year)
## X-squared = 1.1586, df = 1, p-value = 0.2818
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.30662466 0.07855448
## sample estimates:
## prop 1 prop 2
## 0.3859649 0.5000000
```

How would you compute a 95% CI for the difference in proportion sites with *Night* song in 2005 vs 2006?

```
prop.test(~... | ..., data=..., conf.level=...,
success=...)
```

```
prop.test(~... | ..., data=robins, conf.level=...,
success='Present')
```

```
prop.test(~NightSong | Year, data=robins, conf.level=0.95,
success='Present')
```

To carry out tests instead of the CIs above, we just have to specify our hypotheses to test, and the include the input `alternative`

(either ‘two.sided’, the default; or ‘less’ or ‘greater’). The default option (and really the only option we will use) is to test whether the proportions in the two groups are equal.

How would you test the hypotheses below? Let \(p_{N2005}\) be the proportion sites with night song in 2005 and \(p_{N2006}\) be the proportion sites with night song in 2006.

```
prop.test(~...|..., data=...,
alternative=..., success=...)
```

```
prop.test(~NightSong|Year, data=robins,
alternative=..., success=...)
```

```
prop.test(~NightSong|Year, data=robins,
success=...)
```

```
prop.test(~NightSong|Year, data=robins,
alternative='two.sided')
```

What do you conclude? Is the proportion night song different between two years? (This might be important if the researchers wanted to pool data from the two years and use it together to answer other questions…)

What if our sample statistic is a mean rather than a proportion? For example, what if we wanted to find a 95% confidence interval for the mean elephant density in Tanzania using the elephant data? We can use the function `t.test()`

. The inputs will be:

- a formula (giving the name of the variable for which we want to compute the mean, in the form (tilde) VariableName)
`data`

(the name of the dataset, as usual)`conf.level`

(the desired confidence level)

So we would run:

`t.test(~Density, data=elephants, conf.level=0.95)`

```
## ~Density
## <environment: 0x00000000192a66d0>
```

```
##
## One Sample t-test
##
## data: Density
## t = 2.8077, df = 40, p-value = 0.007677
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.568758 3.491488
## sample estimates:
## mean of x
## 2.030123
```

and then we could say we are 95% confident that the true mean density of elephants at Tanzania sites is between 0.569 and 3.49 elephants per \(km^2\).

How would you find a 90% t interval for the mean noise level (variable `NightNoise`

) in the `robins`

dataset?

`t.test(~..., data=..., conf.level=...)`

`t.test(~NightNoise, data=robins, conf.level=...)`

`t.test(~NightNoise, data=robins, conf.level=0.90)`

What if, instead of finding a CI, we wanted to test the null hypothesis that the true mean elephant density was above 1 per square km? If \(\mu\) is the true mean elephant density, we would test \(H_0: \mu = 1\) versus the alternative \(H_A: \mu > 1\). To do this with `t.test`

, we have to specify the additional inputs:

`mu`

, the mean under \(H_0\)`alternative`

(as before, it can be either ‘two.sided’, the default, or it can be ‘less’ or ‘greater’)

```
t.test(~Density, data=elephants, mu=1,
alternative='greater')
```

```
## ~Density
## <environment: 0x00000000192a66d0>
```

```
##
## One Sample t-test
##
## data: Density
## t = 1.4247, df = 40, p-value = 0.081
## alternative hypothesis: true mean is greater than 1
## 95 percent confidence interval:
## 0.8125925 Inf
## sample estimates:
## mean of x
## 2.030123
```

With a p-value of 0.081, we have only weak evidence (if any) against the null hypothesis. We would probably fail to reject the null and conclude that it’s reasonable to believe that the true density is 1 (or less) elephants per \(km^2\).

Your turn: How would you use the `robins`

data to test the hypotheses below? Let \(\mu\) be the true mean noise level during the day. According to the *Handbook of Environmental Noise*, the typical suburban area has background noise levels of about 50dBA. Does Sheffield match this prediction?

\[H_0: \mu = 50\] \[H_A: \mu \neq 50\]

`t.test(~..., data=..., mu=..., alternative=...)`

```
t.test(~..., data=robins, mu=...,
alternative='two.sided')
```

`t.test(~DayNoise, data=robins, mu=50, alternative='two.sided')`

What would you conclude?

We can use very similar code to compute a CI for a difference in means for paired data - we just need an appropriate dataset where the data is a list of differences for which we want to compute the mean. We might find a 95% CI for the difference in noise day minus night in the `robins`

data, using the variable `DiurnalDiff`

:

`t.test(~DiurnalDiff, data=robins, conf.level=0.95)`

```
## ~DiurnalDiff
## <environment: 0x00000000192a66d0>
```

```
##
## One Sample t-test
##
## data: DiurnalDiff
## t = 18.776, df = 120, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 7.593923 9.384259
## sample estimates:
## mean of x
## 8.489091
```

We would find that we were 95% percent confident that the true difference between night and day, over all sites, is between 7.59 and 9.38 dB.

How would you use `t.test()`

with the same data to do a paired t-test of the hypotheses \(H_0: \mu_D = 0\) vs. \(H_A: \mu_D \neq 0\), where \(\mu_D\) is the true population mean of the day-night noise level differences?

```
t.test(~..., data=..., mu=...,
alternative=...)
```

```
t.test(~DiurnalDiffs, data=robins, mu=0,
alternative=...)
```

```
t.test(~DiurnalDiff, data=robins, mu=0,
alternative='two.sided')
```

What if we want to find a CI for a difference between the means of two groups? The only change from before is that we need to alter the formula to have the form: ** (tilde) MainVariableName (vertical line) GroupVariable**, where **MainVariableName** is the name of the variable of which you want to compute the mean, and **GroupVariable** is the name of a categorical variable that divides the data into 2 groups.

For example, we might want to find a 98% CI for the difference in `DayNoise`

levels between the two years 2005 and 2006. How would you do it?

`t.test(~... | ..., data=..., conf.level=...)`

```
t.test(~DayNoise | Year, data= robins,
conf.level=0.98)
```

We are 95% confident that the true difference in day noise levels by year is between -2.39 and 2.73. Since 0 is in the interval, we know that we would fail to reject the null hypothesis of no difference between years at the \(\alpha=0.02\) level, if we were to do the test. Let’s try…

To convert the example above to a test, we just need to specify the `alternative`

and `mu`

, the difference in means according to the null hypothesis.

```
t.test(~...|..., data=..., mu=...,
alternative=...)
```

```
t.test(~DayNoise|Year, data=robins, mu=0,
alternative='two.sided')
```