manyCors <- do(15000) * cor(mpg ~ wt, data=resample(mtcars))
head(manyCors)
## cor
## 1 -0.8217072
## 2 -0.8924989
## 3 -0.8701302
## 4 -0.8225139
## 5 -0.8876192
## 6 -0.9206085
To get standard error, we find the standard deviation of the numerous bootstrap statistics:
mySE = sd(~cor, data=manyCors); mySE
## [1] 0.03392706
myCor = cor(mpg ~ wt, data=mtcars); myCor # gives original sample correlation
## [1] -0.8676594
lower = myCor - 2*mySE; lower
## [1] -0.9355135
upper = myCor + 2*mySE; upper
## [1] -0.7998053
So, the 95% confidence interval we get using the usual approach of \[ (\mbox{point estimate}) \pm 2\cdot(\mbox{SE}) \qquad\mbox{is}\qquad [-0.937, -0.799]. \]
quantile(manyCors$cor, c(0.025, 0.975))
## 2.5% 97.5%
## -0.9258880 -0.7923544
That is, this 95% bootstrap percentile CI is \([-0.927, -0.792]\).
gf_histogram(~cor, data=manyCors, color="black", fill=~(cor>-0.9268 & cor < -0.7909))
(c) To get an 80% bootstrap percentile CI, we should find the 10th and 90th percentiles.
quantile(manyCors$cor, c(.1, .9))
## 10% 90%
## -0.9097541 -0.8262325
The 80% bootstrap percentile CI is \([-0.920, -0.825]\).
femDat = subset(ExerciseHours, Gender=="F")
maleDat = subset(ExerciseHours, Gender=="M")
manyDiffs = do(15000) * (mean(~Exercise, data=resample(maleDat)) - mean(~Exercise, data=resample(femDat)))
head(manyDiffs)
## result
## 1 7.833333
## 2 4.016667
## 3 3.616667
## 4 1.933333
## 5 -3.800000
## 6 1.483333
To get a 90% bootstrap percentile CI means we need to find the \(0.05\)- and \(0.95\)-quantiles.
quantile(manyDiffs$result, c(0.05, 0.95))
## 5% 95%
## -0.8833333 6.7833333
So, the interval is \([-0.85, 6.82]\), meaning we think the true difference \(\mu_M - \mu_F\) in exercise hours lies somewhere between \(-0.85\) (men exercising \(0.85\) hours less than women) and \(6.82\) (men exercising 6.82 hours more).
The sample, which we will take to be a random sample (though that is debatable), is found in the data frame StudentSurvey from the Lock5withR package. I separate out the female cases from the male cases, creating two smaller data frames (together, they comprise all the data found in StudentSurvey)
femDat <- subset(StudentSurvey, Gender=="F")
maleDat <- subset(StudentSurvey, Gender=="M")
head(femDat) # displays first few rows from female data
## Year Gender Smoke Award HigherSAT Exercise TV Height Weight
## 2 Sophomore F Yes Academy Math 4 7 66 120
## 5 Sophomore F No Nobel Verbal 3 3 65 150
## 6 Sophomore F No Nobel Verbal 5 4 65 114
## 7 FirstYear F No Olympic Math 10 10 66 128
## 9 Junior F No Nobel Verbal 3 6 61 NA
## 10 FirstYear F No Nobel Math 12 1 60 115
## Siblings BirthOrder VerbalSAT MathSAT SAT GPA Pulse Piercings Sex
## 2 2 2 520 630 1150 2.50 66 3 Female
## 5 1 1 720 450 1170 2.70 40 6 Female
## 6 2 2 600 550 1150 3.20 80 4 Female
## 7 1 1 640 680 1320 2.77 94 8 Female
## 9 2 2 550 550 1100 2.80 60 7 Female
## 10 7 8 670 700 1370 3.70 94 2 Female
There were 16 out of 169 females who said they smoke. Several commands that may be used to learn this sort of information appear below without their corresponding output. Type them out yourself, studying just what each the command says (the command name, and the information it provides), as well as what it produces.
tally(Smoke ~ Gender, data=StudentSurvey)
tally(Smoke ~ Gender, data=StudentSurvey, format="proportion")
tally(~Smoke, data=femDat)
tally(~Smoke, data=femDat, format="proportion")
prop(~Smoke, data=femDat, success="Yes")
Using the final version of these, and adapting it to the male data, we can compute the difference in sample proportions \(\hat p_F - \hat p_M\), our point estimate for the true (population) difference \(p_F - p_M\) in proportions of college-age female and male smokers.
prop(~Smoke, data=femDat, success="Yes") - prop(~Smoke, data=maleDat, success="Yes")
## prop_Yes
## -0.04522182
We already have female and male data in separate data frames. We will use the resample() command, employing these data frames as bags out of which to make our draws. By itself
resample(femDat)
draws 169 cases out of the femDat data frame with replacement, while
prop(~Smoke, data=resample(femDat), success="Yes")
computes the proportion of females in 169 such draws that have Yes in the Smoke column.
We want to compute separate proportions for similar draws taken from both bags, and subtract them, all to get one bootstrap statistic.
prop(~Smoke, data=resample(femDat), success="Yes") - prop(~Smoke, data=resample(maleDat), success="Yes")
To get a bootstrap distribution, then, requires repeating this process many times. I do so, storing the result in the nondescript name \(x\):
x <- do(5000) * (prop(~Smoke, data=resample(femDat), success="Yes") - prop(~Smoke, data=resample(maleDat), success="Yes"))
head(x)
## prop_Yes
## 1 -0.05040316
## 2 -0.06518073
## 3 -0.06444492
## 4 -0.01266211
## 5 -0.03633075
## 6 -0.10221664
It looks relatively bell-shaped and symmetric.
gf_histogram(~prop_Yes, data=x, color="black")
The approximate standard error is the standard deviation of these numbers just plotted:
sd(~prop_Yes, data=x)
## [1] 0.03387151
Since the point estimate was \(\hat p_F - \hat p_M = -0.045\), the 95% bootstrap confidence interval is \[ -0.045 \pm 2(0.034), \qquad\mbox{or}\qquad [-0.113, 0.023]. \] The 95% bootstrap percentile CI has lower/upper bounds that come from the \(0.025\)- and \(0.975\)-quantiles:
quantile(x$prop_Yes, c(.025,.975))
## 2.5% 97.5%
## -0.11034123 0.02210504
Command summary: The whole thing, start to finish, amounts to these commands:
femDat = subset(ExerciseHours, Gender=="F")
maleDat = subset(ExerciseHours, Gender=="M")
prop(~Smoke, data=femDat, success="Yes") - prop(~Smoke, data=maleDat, success="Yes") # to get point estimate
x <- do(5000) * (prop(~Smoke, data=resample(femDat), success="Yes") - prop(~Smoke, data=resample(maleDat), success="Yes"))
sd(~prop_Yes, data=x) # gives the standard error
quantile(x$prop_Yes, c(.025,.975)) # give the quantiles
The Lock5 data frame containing this data is called FloridaLakes.
head(FloridaLakes)
## ID Lake Alkalinity pH Calcium Chlorophyll AvgMercury NumSamples
## 1 1 Alligator 5.9 6.1 3.0 0.7 1.23 5
## 2 2 Annie 3.5 5.1 1.9 3.2 1.33 7
## 3 3 Apopka 116.0 9.1 44.1 128.3 0.04 6
## 4 4 Blue Cypress 39.4 6.9 16.4 3.5 0.44 12
## 5 5 Brick 2.5 4.6 2.9 1.8 1.20 12
## 6 6 Bryant 19.6 7.3 4.5 44.1 0.27 14
## MinMercury MaxMercury ThreeYrStdMercury AgeData
## 1 0.85 1.43 1.53 1
## 2 0.92 1.90 1.33 0
## 3 0.04 0.06 0.04 0
## 4 0.13 0.84 0.44 0
## 5 0.69 1.50 1.33 1
## 6 0.04 0.48 0.25 1
We have seen that a way to calculate the slope of the least-squares regression line in RStudio is using the lm() command. Actually, that command calculates a bunch of results one might want to know, though to see most of them, you have to know how to ask. We look at the names of things it returns:
linModResults <- lm(AvgMercury ~ pH, data=FloridaLakes)
names(linModResults)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
One of the named resultsthe one that is printed to the screen if we don’t specifically ask for anything elseis coefficients.
linModResults$coefficients
## (Intercept) pH
## 1.5309187 -0.1523009
We only want the slope, so we ask for the 2nd coefficient:
linModResults$coefficients[2]
## pH
## -0.1523009
It’s possible to put the request for slope from lm( ) all in one command, which is what I will do further down when bootstrapping:
lm(AvgMercury ~ pH, data=FloridaLakes)$coefficients[2]
## pH
## -0.1523009
This command produced the slope from our sample of lakes, the point estimate for the true slope \(\beta_1\). The only change required to produce a single bootstrapped slope is to add the resample() command:
lm( AvgMercury ~ pH, data=resample(FloridaLakes) )$coefficients[2]
## pH
## -0.1739011
Thus, to obtain a bootstrap distribution, we just need to repeat this command many times:
b1Many <- do(5000) * lm( AvgMercury ~ pH, data=resample(FloridaLakes) )$coefficients[2]
head(b1Many)
## pH
## 1 -0.1508372
## 2 -0.1479739
## 3 -0.1281650
## 4 -0.1499160
## 5 -0.1222065
## 6 -0.1395788
We can view an histogram, calculate the standard error, or calculate quantiles, all by referring to the numbers in the pH column of data frame b1Many:
gf_histogram(~pH, data=b1Many, color="black")
sd(~pH, data=b1Many)
## [1] 0.02736207
quantile(b1Many$pH, c(0.025, 0.975))
## 2.5% 97.5%
## -0.20961430 -0.09988689
Owing to a bit of negative skew (or left-skewness) in the bootstrap distribution, the 95% confidence interval \[ -0.152 \pm 2(0.027), \qquad\mbox{or}\qquad [-0.206, -0.098], \] does not match the lower/upper bounds of the 95% bootstrap percentile CI \([-0.210,-0.100]\) as closely as it would if the bootstrap distribution were truly normal.
Command summary: The whole thing, start to finish, amounts to these commands:
lm(AvgMercury ~ pH, data=FloridaLakes)$coefficients[2] # gives b1, the point estimate
b1Many <- do(5000) * lm( AvgMercury ~ pH, data=resample(FloridaLakes) )$coefficients[2]
sd(~pH, data=b1Many) # gives the standard error
quantile(b1Many$pH, c(0.025, 0.975)) # gives the quantiles
File creation date: 2018-10-01
Author: Thomas Scofield