Bootstrap examples of Section 3.4 (Sept. 28 class)

Problem.
1. We would need a random sample of cars. For each car in the sample, we need to measure its weight and its gas mileage (perhaps fill the tank, drive it for several hours at a certain constant rate, and divide the distance traveled by the number of gallons of gas consumed).
2. This would be a sample correlation, denoted by \(r\).
3. The random sample of cars has some sample size, call it \(n\). To construct an approximate sampling distribution, one would draw not just one random sample of cars of size \(n\), but a whole lot of them, using each to produce a correlation \(r\).
4. Having drawn one sample of size \(n\) of cars, one could put a slip of paper in a bag, one for each car in the sample. A slip of paper would have that car's weight and mpg written on it. To get one bootstrap sample, one would draw, with replacement, \(n\) slips from the bag. The corresponding bootstrap statistic would be computed as the correlation of gas mileage vs. weight for the cars selected in the bootstrap sample. One would repeat the construction of a bootstrap statistic very many times.
5. We generate a bootstrap distribution via the command

manyCors <- do(15000) * cor(mpg ~ wt, data=resample(mtcars))
head(manyCors)

##          cor
## 1 -0.8217072
## 2 -0.8924989
## 3 -0.8701302
## 4 -0.8225139
## 5 -0.8876192
## 6 -0.9206085

To get standard error, we find the standard deviation of the numerous bootstrap statistics:

mySE = sd(~cor, data=manyCors); mySE

## [1] 0.03392706

myCor = cor(mpg ~ wt, data=mtcars); myCor     # gives original sample correlation

## [1] -0.8676594

lower = myCor - 2*mySE; lower

## [1] -0.9355135

upper = myCor + 2*mySE; upper

## [1] -0.7998053

So, the 95% confidence interval we get using the usual approach of \[ (\mbox{point estimate}) \pm 2\cdot(\mbox{SE}) \qquad\mbox{is}\qquad [-0.937, -0.799]. \]

Problem.
1. It does not appear that there is the same amount of pink in the left-hand tail as there is in the right-hand. If the two bounds (upper and lower) of the confidence interval were at the \(0.025\)- and \(0.975\)-quantiles, then there should be the same amount of pink region at both ends.
2. We need quantiles for the same collection of numbers that are plotted in the histogram. These numbers, serving as lower and upper bounds of a 95% bootstrap percentile CI, will (generally) result in the same areas in both tails, unlike the the construction in Problem 1. The lower/upper bounds would be pretty much the same in both methods if the bootstrap distribution were approximately normal.

quantile(manyCors$cor, c(0.025, 0.975))

##       2.5%      97.5% 
## -0.9258880 -0.7923544

That is, this 95% bootstrap percentile CI is \([-0.927, -0.792]\).

gf_histogram(~cor, data=manyCors, color="black", fill=~(cor>-0.9268 & cor < -0.7909))

quantile(manyCors$cor, c(.1, .9))

##        10%        90% 
## -0.9097541 -0.8262325

The 80% bootstrap percentile CI is \([-0.920, -0.825]\).

Problem
1. We need a random sample of college students, with some number being female and some being male. Gender is the explanatory variable, and cannot be assigned, so it is an observational study. For each case, the response variable is the number of exercise hours per week.
2. The means are sample means, so the difference in means is labeled \(\overline x_M - \overline x_F\), a sample statistic.
3. To approximate the sampling distribution of \(\overline x_M - \overline x_F\), we would have to gather many random samples of size 50—in fact, where the number of women in each sample is the same as in the original, and the number of men makes up the rest. Each such sample could be used to generate the separate female and male (sample) means, and then we would subtract them: \(\overline x_M - \overline x_F\).
4. To get a bootstrap distribution, it is as if we use two bags. Into one bag, we put slips representing exercise hours for the women in the study; in the other there are slips with the exercise hours of the men. We draw the right number of women's slips, with replacement, compute the value of \(\overline x_F\), then complete the selection of 50 slips by drawing the rest from the men's bag and calculating \(\overline x_M\). Our bootstrap statistic is the difference \(\overline x_M - \overline x_F\). Finally, by repeating the 50 draws over and over many times, we obtain a bootstrap distribution.
5. The first few lines are effectively putting the female and male numbers in separate bags.

femDat = subset(ExerciseHours, Gender=="F")
maleDat = subset(ExerciseHours, Gender=="M")
manyDiffs = do(15000) * (mean(~Exercise, data=resample(maleDat)) - mean(~Exercise, data=resample(femDat)))
head(manyDiffs)

##      result
## 1  7.833333
## 2  4.016667
## 3  3.616667
## 4  1.933333
## 5 -3.800000
## 6  1.483333

To get a 90% bootstrap percentile CI means we need to find the \(0.05\)- and \(0.95\)-quantiles.

quantile(manyDiffs$result, c(0.05, 0.95))

##         5%        95% 
## -0.8833333  6.7833333

So, the interval is \([-0.85, 6.82]\), meaning we think the true difference \(\mu_M - \mu_F\) in exercise hours lies somewhere between \(-0.85\) (men exercising \(0.85\) hours less than women) and \(6.82\) (men exercising 6.82 hours more).

Examples from Monday, Oct. 1

Difference in smoking rates between college-age females and males

The sample, which we will take to be a random sample (though that is debatable), is found in the data frame StudentSurvey from the Lock5withR package. I separate out the female cases from the male cases, creating two smaller data frames (together, they comprise all the data found in StudentSurvey)

femDat <- subset(StudentSurvey, Gender=="F")
maleDat <- subset(StudentSurvey, Gender=="M")
head(femDat)    # displays first few rows from female data

##         Year Gender Smoke   Award HigherSAT Exercise TV Height Weight
## 2  Sophomore      F   Yes Academy      Math        4  7     66    120
## 5  Sophomore      F    No   Nobel    Verbal        3  3     65    150
## 6  Sophomore      F    No   Nobel    Verbal        5  4     65    114
## 7  FirstYear      F    No Olympic      Math       10 10     66    128
## 9     Junior      F    No   Nobel    Verbal        3  6     61     NA
## 10 FirstYear      F    No   Nobel      Math       12  1     60    115
##    Siblings BirthOrder VerbalSAT MathSAT  SAT  GPA Pulse Piercings    Sex
## 2         2          2       520     630 1150 2.50    66         3 Female
## 5         1          1       720     450 1170 2.70    40         6 Female
## 6         2          2       600     550 1150 3.20    80         4 Female
## 7         1          1       640     680 1320 2.77    94         8 Female
## 9         2          2       550     550 1100 2.80    60         7 Female
## 10        7          8       670     700 1370 3.70    94         2 Female

There were 16 out of 169 females who said they smoke. Several commands that may be used to learn this sort of information appear below without their corresponding output. Type them out yourself, studying just what each the command says (the command name, and the information it provides), as well as what it produces.

tally(Smoke ~ Gender, data=StudentSurvey)
tally(Smoke ~ Gender, data=StudentSurvey, format="proportion")
tally(~Smoke, data=femDat)
tally(~Smoke, data=femDat, format="proportion")
prop(~Smoke, data=femDat, success="Yes")

Using the final version of these, and adapting it to the male data, we can compute the difference in sample proportions \(\hat p_F - \hat p_M\), our point estimate for the true (population) difference \(p_F - p_M\) in proportions of college-age female and male smokers.

prop(~Smoke, data=femDat, success="Yes") - prop(~Smoke, data=maleDat, success="Yes")

##    prop_Yes 
## -0.04522182

We already have female and male data in separate data frames. We will use the resample() command, employing these data frames as “bags” out of which to make our draws. By itself

resample(femDat)

draws 169 cases out of the femDat data frame with replacement, while

prop(~Smoke, data=resample(femDat), success="Yes")

computes the proportion of females in 169 such draws that have “Yes” in the Smoke column.

We want to compute separate proportions for similar draws taken from both bags, and subtract them, all to get one bootstrap statistic.

prop(~Smoke, data=resample(femDat), success="Yes") - prop(~Smoke, data=resample(maleDat), success="Yes")

To get a bootstrap distribution, then, requires repeating this process many times. I do so, storing the result in the nondescript name \(x\):

x <- do(5000) * (prop(~Smoke, data=resample(femDat), success="Yes") - prop(~Smoke, data=resample(maleDat), success="Yes"))
head(x)

##      prop_Yes
## 1 -0.05040316
## 2 -0.06518073
## 3 -0.06444492
## 4 -0.01266211
## 5 -0.03633075
## 6 -0.10221664

It looks relatively bell-shaped and symmetric.

gf_histogram(~prop_Yes, data=x, color="black")

The approximate standard error is the standard deviation of these numbers just plotted:

sd(~prop_Yes, data=x)

## [1] 0.03387151

Since the point estimate was \(\hat p_F - \hat p_M = -0.045\), the 95% bootstrap confidence interval is \[ -0.045 \pm 2(0.034), \qquad\mbox{or}\qquad [-0.113, 0.023]. \] The 95% bootstrap percentile CI has lower/upper bounds that come from the \(0.025\)- and \(0.975\)-quantiles:

quantile(x$prop_Yes, c(.025,.975))

##        2.5%       97.5% 
## -0.11034123  0.02210504

Command summary: The whole thing, start to finish, amounts to these commands:

femDat = subset(ExerciseHours, Gender=="F")
maleDat = subset(ExerciseHours, Gender=="M")
prop(~Smoke, data=femDat, success="Yes") - prop(~Smoke, data=maleDat, success="Yes")   # to get point estimate
x <- do(5000) * (prop(~Smoke, data=resample(femDat), success="Yes") - prop(~Smoke, data=resample(maleDat), success="Yes"))
sd(~prop_Yes, data=x)    # gives the standard error
quantile(x$prop_Yes, c(.025,.975))    # give the quantiles

True slope of regression line between pH and mercury levels in Florida lakes

The Lock5 data frame containing this data is called FloridaLakes.

head(FloridaLakes)

##   ID         Lake Alkalinity  pH Calcium Chlorophyll AvgMercury NumSamples
## 1  1    Alligator        5.9 6.1     3.0         0.7       1.23          5
## 2  2        Annie        3.5 5.1     1.9         3.2       1.33          7
## 3  3       Apopka      116.0 9.1    44.1       128.3       0.04          6
## 4  4 Blue Cypress       39.4 6.9    16.4         3.5       0.44         12
## 5  5        Brick        2.5 4.6     2.9         1.8       1.20         12
## 6  6       Bryant       19.6 7.3     4.5        44.1       0.27         14
##   MinMercury MaxMercury ThreeYrStdMercury AgeData
## 1       0.85       1.43              1.53       1
## 2       0.92       1.90              1.33       0
## 3       0.04       0.06              0.04       0
## 4       0.13       0.84              0.44       0
## 5       0.69       1.50              1.33       1
## 6       0.04       0.48              0.25       1

We have seen that a way to calculate the slope of the least-squares regression line in RStudio is using the lm() command. Actually, that command calculates a bunch of results one might want to know, though to see most of them, you have to know how to ask. We look at the names of things it returns:

linModResults <- lm(AvgMercury ~ pH, data=FloridaLakes)
names(linModResults)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

One of the named results—the one that is printed to the screen if we don’t specifically ask for anything else—is coefficients.

linModResults$coefficients

## (Intercept)          pH 
##   1.5309187  -0.1523009

We only want the slope, so we ask for the 2nd coefficient:

linModResults$coefficients[2]

##         pH 
## -0.1523009

It’s possible to put the request for slope from lm( ) all in one command, which is what I will do further down when bootstrapping:

lm(AvgMercury ~ pH, data=FloridaLakes)$coefficients[2]

##         pH 
## -0.1523009

This command produced the slope from our sample of lakes, the point estimate for the true slope \(\beta_1\). The only change required to produce a single bootstrapped slope is to add the resample() command:

lm( AvgMercury ~ pH, data=resample(FloridaLakes) )$coefficients[2]

##         pH 
## -0.1739011

Thus, to obtain a bootstrap distribution, we just need to repeat this command many times:

b1Many <- do(5000) * lm( AvgMercury ~ pH, data=resample(FloridaLakes) )$coefficients[2]
head(b1Many)

##           pH
## 1 -0.1508372
## 2 -0.1479739
## 3 -0.1281650
## 4 -0.1499160
## 5 -0.1222065
## 6 -0.1395788

We can view an histogram, calculate the standard error, or calculate quantiles, all by referring to the numbers in the pH column of data frame b1Many:

gf_histogram(~pH, data=b1Many, color="black")

sd(~pH, data=b1Many)

## [1] 0.02736207

quantile(b1Many$pH, c(0.025, 0.975))

##        2.5%       97.5% 
## -0.20961430 -0.09988689

Owing to a bit of negative skew (or left-skewness) in the bootstrap distribution, the 95% confidence interval \[ -0.152 \pm 2(0.027), \qquad\mbox{or}\qquad [-0.206, -0.098], \] does not match the lower/upper bounds of the 95% bootstrap percentile CI \([-0.210,-0.100]\) as closely as it would if the bootstrap distribution were truly normal.

Command summary: The whole thing, start to finish, amounts to these commands:

lm(AvgMercury ~ pH, data=FloridaLakes)$coefficients[2]    # gives b1, the point estimate
b1Many <- do(5000) * lm( AvgMercury ~ pH, data=resample(FloridaLakes) )$coefficients[2]
sd(~pH, data=b1Many)    # gives the standard error
quantile(b1Many$pH, c(0.025, 0.975))    # gives the quantiles

File creation date: 2018-10-01

Author: Thomas Scofield