Outline for

Introduction to the Practice of Statistics

by David S. Moore and George P. McCabe

Chapter 1: 1.1, 1.2, 1.3
Chapter 2: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7
Chapter 3: 3.1, 3.2, 3.3, 3.4
Chapter 4: 4.1, 4.2, 4.3, 4.4, 4.5
Chapter 5: 5.1, 5.2, 5.3
Chapter 6: 6.1, 6.2, 6.3, 6.4
Chapter 7: 7.1, 7.2, 7.3
Chapter 8: 8.1, 8.2
Chapter 9
Chapter 12
Chapter 13

Chapter 1

What is statistics?

Section 1.1

Important questions of statistics

What questions are relevant to the data?
Who are the individuals the data describes?
What, precisely, are the variables?
How was the data acquired?
How can the information in a single variable be described succinctly?
Are there relationships between variables?

Exploring (single) variables

Use of graphs, charts, stem plots, histograms? time plots (Q: Can each of these be used equally well for all types of variables?)
Features of note: center, spread, deviations, symmetry, number of modes, outliers, seasonal variation, trends (Q: Do each of the aforementioned concepts apply to all types of variables?)

Terms to Know: statistics, individuals, cases, variable (categorical and quantitative), frequency and relative frequency, distribution, bar graph vs. histogram (what is the difference?)

Section 1.2

Numerical summaries of distributions

Why do we use them?
What are the the drawbacks of their use?
For what types of distributions are they most effective (Note: the answer may not be the same for all numerical summaries!)
Which are resistant?
Which measures of center and spread are paired together?

Outliers

Be able to recognize them (from graph — See Sect 1.1; 1.5 × IQR method)
Propose appropriate (context-specific) ways of dealing with them

Use of Technology:

Enter data
Sort data
Find the mean, median, variance, standard deviation
Produce box plot (Minitab only)

Terms to Know: mean, median, measure of center/spread, resistant measure, percentiles/quartiles, IQR, five-number summary, box plot, linear transformation

Section 1.3

Density curves

How can a smooth curve represent a distribution?
Why is this type of “mathematical model” useful?
Why is it essential that the area under such a curve be 1?
Describe the placement of mean, median and percentiles along such a curve
What is special about “normal” (density) curves? How many such curves are there?

Normal distributions

How do you tell if a distribution is well-approximated as a normal distribution?
What are some types of data which are typically normally distributed?
Standardizing a normal distribution
1. Amounts to a linear transformation
2. Computing z-scores and going from such a standardized score back to an unstandardized one
3. Use of Table A to determine area under the standard normal curve, and interpreting the meaning of such areas
4. Why standardize?
5. Normal probabilities

Use of Technology:

answer questions such as those posed in Examples 1.25-1.27
produce normal quantile plots for a given set of data (Minitab only)
perform density estimation for a given data set (Minitab only)

Terms to Know: density curve, normal curve, (standard) normal distribution, standardized value (or z-score), normal quantile plot (normal probability plot), granularity

Chapter 2: Looking at Data — Relationships

Association between variables

Describe tendencies, not hard-and-fast rules
Not same as causation

Explanatory variable(s)

Often chosen as a result of how data is to be used
Even if data suggests association, wrong to assume changes in explanatory variable cause changes in response variable

Guiding principles

Start with graphical analysis, then add numerical summaries
Look for overall patterns and deviations from those patterns
When the overall pattern is quite regular, use a compact mathematical model to describe it

Terms to Know: associated variables, explanatory/response variables, causation

Section 2.1: Scatterplots

Scatterplots

Relationship between two quantitative variables
Each individual in the study has corresponding point
If one variable designated as explanatory, put it on horizontal axis
Including a categorical variable as a 3rd variable

Examining scatterplots

Look for the overall pattern in the graph, and for striking deviations from that pattern
Describe the overall pattern by the form, direction and strength of the relationship
Keep an eye out for outliers, noticeable deviations from the overall pattern
Avoid quick judgment, even with strong overall pattern, until you consider lurking variables

Studying relationships between a categorical variable and a quantitative one

Use methods of Chapter 1 (back-to-back stemplots, side-by-side boxplots, etc.)
Cannot discuss positive/negative relationship except in those cases where categorical variable has natural ordering (See Example 8, p 115)

Terms to Know: scatterplot, overall pattern, deviation from a pattern, form/direction/strength of a relationship, outlier, positive/negative association, linear relationship, cluster, smoothing a scatterplot

Section 2.2: Correlation

Correlation

Establishes the strength of a linear relationship between two quantitative variables
Properties

r has same value regardless of which variable is considered explanatory
Direction of relationship comes from sign of r
r has no units, and is unaffected by which units are used for a variable
-1 £ r £ 1
Will not detect strong nonlinear relationships between variables (Plot your data!)
Not resistant to outliers

Understand the formula as one involving standardized scores for the two variables

Determining r values by sight

Changes in scale do not affect correlation, but can make our eyes think so
Practice determining r ; or see a scatterplot for a given value of r

Terms to Know: linear relationship, correlation, strength of a relationship

Section 2.3: Least-Squares Regression

Least-squares regression line

Requires two quantitative variables, one designated as explanatory (x), the other as response (y)
The line that best fits the data (i.e., of all possible lines drawn, it's the one that makes the sum of squares of vertical distances to data points the smallest)
Calculation of slope, y-intercept from data (p. 141)
Is dependent upon the units of measurement for explanatory/response variables

Prediction

Regression line is used to predict value of response variable y at a fixed value of explanatory variable x
Reliability/accuracy

Interpolation (prediction at x value falling inside observed data values) vs. extrapolation (prediction at x values far from observed values; often inaccurate)
Interpolated values should be good if strength of fit is good (i.e., if r² is close to 1 — see below)
Poor results may occur if regression line in one population is used to make predictions in another population

Connections between correlation and regression

Correlation used in calculation of slope for regression line
r² = (variance of predicted values)/(variance of observed values); i.e., it is the fraction of variation in response values that is explained by least-squares regression of y on x

Terms to Know: regression line, slope, intercept, prediction, square of the correlation

Section 2.4: Cautions about Correlation and Regression

Assessing strength of a linear relationship

Look at residuals

Difference of observed value and predicted (by the regression line) value
Part of the variation in the response variable left unexplained by the linear association
Mean of residuals for least-squares regression is always 0

Residual plots

Scatterplot with unchanged explanatory variable, but response variable is the residual
Can support or refute whether overall pattern of original variables is linear (see discussion of Figure 2.19 on p. 156)

Looking beyond regression

Time plot of residuals — one way presence of lurking variable may be detected
Investigating outliers (both in x and y directions

Large studentized residuals help to detect outliers
Large DFITS help to detect influential observations

Warnings

Beware lurking variables
Do not take associations as causation
Correlations based on averaged data are likely to be much stronger than with individual observations
Successful prediction does not require a cause-and-effect relationship

Terms to Know: residual, residual plot, outliers, influential points, restricted-range

Section 2.5: An Application — Exponential Growth and World Oil Production

Section 2.6: Relations in Categorical Data

Distributions of a two-way table

Marginal distributions
Conditional distributions

At the cell level
Involves looking at row/column percents
Key to discovering nature of relationship between variables

That some relationship exists can be ascertained using test of significance of Chapter 9
Column/row percents can be plotted in histogram

Simpson's paradox

Illustrates how aggregate data can hide lurking variables

Terms to Know: counts/frequencies, percents/relative frequencies, two-way table, row/column variable, marginal distributions, conditional distributions, three-way table, aggregate data

Section 2.7: The Question of Causation

Explanations for associations

Direct causation

Direct cause-and-effect between explanatory and response variables
Difficult to prove via an observational study; best established through controlled experiments

Common response

Both explanatory and response variables change in response to some third (lurking) variable

Confounding

Lurking variable(s) present
Cannot distinguish between effects of several variables upon the response variable

Evidence for causation outside of a controlled experiment

Association is very strong
Association is consistent across numerous studies
Higher levels of treatment associated with stronger responses
Alleged cause preceded (in time) the response
Alleged cause is plausible

Terms to Know: cause-and-effect, lurking variables, common response, confounding

Chapter 3

exploratory data analysis
an answer to a specfic question

Section 3.1

Some Internet resources of (available) data

Governmental statistical offices for the U.S., Canada, and Mexico
The General Social Survey

Designs for producing data

Sampling

census

Observational studies vs. experiments

What types of questions can/cannot be answered using an observational study? (see also Section 3.2)
Be able to give examples of each type.

Terms to Know: anecdotal evidence, available data, designs, sample, census, observational study, experiment, confounding

Section 3.2

Advantages of experiments (over observational studies):

Provide good evidence for causation
Can minimize the effect of lurking variables
Can study effects of combined factors

Design of experiments
1. Control
2. Randomization
3. Replication (many experimental units reduce effects of chance variation

Terms to Know: experimental units/subjects, treatment, factors, explanatory/response/lurking variables, level, placebo, placebo effect, causation, bias, control/experimental group, matching, randomization, statistical significance, blind/double-blind study, lack of realism, blocks

Section 3.3

Some designs for sampling

Voluntary response samples

Ex: A TV news magazine gives an 800-number for the audience to express its opinion.
Highly subject to bias

Probability samples

Each member of the population is assigned a certain probability for being chosen and random chance is used to choose
SRS is special case whem each member of population is assigned the same probability (i.e., equally-likely to be chosen)

Stratified random samples

Population is divided into “strata” (analogous to “blocks” from experiments)
An SRS is taken within each stratum

Mulstistage samples

A hybrid of stratifying and using SRS

Population is stratified (or subdivided)
SRS used to select subdivisions to work wth

Makes “door-to-door” interviews more practical (less costly)

Issues surrounding surveys (and statistical studies in general)

Questions to ask in determining the soundness of a statistical study
Given the answers to the above list of questions, what types of bias is the study prone to?

Terms to Know: population, sample, undercoverage, nonresponse/response bias

Section 3.4

Simulations

Understand how simulations give a sense of the amount of variability in a statistic for various sample sizes.
Be able to use Minitab to produce a sampling distribution (such as Fig. 3.6) when given:

sample size
actual value of parameter p

Questions:

If the sampling proportion of an SRS is an “unbiased estimator”, what would an example of a “biased estimator” look like?
In Example 3.22 we saw that 95% of the time an SRS of sample size 2500 comes within 2% of the actual parameter value 0.6. Is this true if the sample size is the same as the size of the population (i.e., a census)? Why or why not?
Just how can, in the words of the authors on p. 275, “the true distance of a statistic from the parameter it is estimating ... be much larger than the sampling distribution suggests”?

Terms to Know: parameter, statistic, sampling variablility, simulation, sampling distribution, unbiased estimator, capture-recapture sampling

Chapter 4

Section 4.1

Definition of probability

Empirical (probability = long-term relative frequency)
Relies entirely on randomness
1. Short-term results are unpredictable
2. Long-term pattern behavior

Independence of trials

Fundamental nature of the assumption

Q: Is probability theory anti-religion?

Difficult for people to accept

Law of Averages (the gambler's fallacy): “I just got 10 heads in a row; must be due for some tails.”
Myth of short-run regularity: “I made 10 shots in a row; must have the touch tonight.”

Terms to Know: random, probability (empirical vs. intuitive — see Section 4.2, p. 297, for the latter)

Section 4.2

Probability Model (Note the connection to sampling distribution)

List of all possible outcomes (sample space )
Assignment of a probability to each outcome

0 £ P(A) £ 1 for every event A

Note how probabilities are assigned under the assumption that the finitely-many outcomes are equally-likely.

If S denotes the sample space, then P(S) = 1
P(A ^C ) = 1 - P(A) for every event A
Sum rule: P(A or B) = P(A) + P(B)

Requires events A and B to be disjoint (use of Venn diagram to determine this)
P(A) = å P(A_i ) , where the A_i are the (finitely-many) individual outcomes in event A
See Section 4.5 for more general sum rule (one that applies even for non-disjoint events)

Computing probabilities using multiplication rule: P(A and B ) = P(A) P(B)

Events A and B must be independent

Different concept from disjointness
Independence cannot be determined from Venn diagram (unlike disjointness)

See Section 4.5 for more general multiplication rule (one that applies even for non-independent events)

Terms to Know: probability model, outcome, sample space, event, disjoint events, independent events, complement of an event, addition (sum)/multiplication rule

Section 4.3

Discrete random variable

A certain kind of probability model
Lends itself well to display via a probability histogram. Some have special names:

Figure 4.6(a) (probability of obtaining a certain digit from a table of random digits) is example of a uniform probability distribution
Figure 4.8 (Example 4.16) is an example of a binomial probability distribution

Continuous random variable

Probability model is specified by a density curve

Probability of an event corresponds to an area under the curve. Note the implications of P(S) = 1 .
Individual outcomes have probability 0. Thus P(X < v) and P(X £ v) are equal.

Important class of examples are the normally-distributed continuous random variables

Terms to Know: (discrete/continuous) random variable, density curve, probability histogram

Section 4.4

Summarizing probability distributions

Mean (expected value) m_X of a random variable X

Know how to calculate for discrete random variables (use weighted average)
Law of Large Numbers and estimation of m

Variance/standard deviation for a discrete random variable
Rules for means, variances

under linear tranformation: a + bX
of the sum of two random variables (Note the independence requirement for variances)

Law of Large Numbers

Allows stable prediction of random outcomes
Does not tell how large
Contrast to the various “laws of small numbers ” on p. 332 — randomness is generally misunderstood by the public

Terms to Know: weighted average, mean/variance/standard deviation of a probability distribution (mean = expected value), Law of Large Numbers

Section 4.5

Conditional probability

Events need not be independent
Q: What does P(B | A ) equal when A and B are independent?

Expanded rules for computing probabilities

Addition Rule (Inclusion-Exclusion Principle): P(A or B ) = P(A) + P(B) - P(A and B )
Multiplication Rule: P(A and B) = P(A) P(B | A )
Combinations of these rules and use of tree diagrams
Bayes' Rule

Terms to Know: union, intersection, conditional probability, personal probability

Chapter 5

Section 5.1

Binomial Distributions

A discrete probability distribution
Applicable situations (i.e., binomial settings)

n (a fixed number) independent observations (rule-of-thumb: population-to-sample size ratio at least ten is sufficient for approximate independence; see Example 5.3 and following)
Each observation falls into one of two categories (call them successes and failures)
Probability of success is p (fixed) for each observation

A different binomial distribution B(n, p) for each pair of values n and p
1. Skewed for small values of n, p
2. Approximately normal when np ³ 10 and n(1 - p) ³ 10 (that is, when the expected number of successes and failures are both at least 10)
Computing probabilities for binomially-distributed random variables

Table C
Using calculator/Minitab

Formulas for mean, standard deviation of B(n, p) (bottom, p. 380)

Count vs sample proportion

Both are natural random variables for categorical outcomes
proportion = (count)/(number of observations)
Mean and standard deviation formulas for proportion

Valid if count is binomially distributed; approx. valid in SRS where population-to-sample size ratio is at least 10
Mean formula shows sample proportion for an SRS is unbiased estimator of parameter p
S.D. formula quantifies how spread goes down as sample size goes up

Probability distribution of sample proportion

Not binomially distributed even when count is
Like count, it is approximately normal when np ³ 10 and n(1 - p) ³ 10

Approximating binomially-distributed random variable with normal distribution

As a rule, do only if np ³ 10 and n(1 - p) ³ 10
Expect better results if p » 1/2
Use continuity correction when n is not large

Binomial probabilities P(X = k)

Found in Table C for certain values of n , k and p
Formula from which these table values come (see p. 388)

Terms to Know: population vs. sampling distribution (related to parameter vs. statistic ), sample proportion, binomial distribution, count, continuity correction, success/failure, factorial, unbiased estimator (p. 382)

Section 5.2: Sampling Distribution of Sample Mean

Sample mean in an SRS of size n

A random variable X

Individual observations X_i also random variables

Each X_i distributed as the population, if population is large compared to size n of sample
Mean and S.D. for each X_i is population mean and S.D.:m and s

Definition of X: X = (1/n) S X_i
Mean and standard deviation of X — see p. 399

Distribution of mean X compared to population distribution

Distribution of X is

normal if population (individual X_i) is normal
increasingly normal (as sample size n increases) even if population is not (Central Limit Theorem)

Spread for X not as great as for population; decreases as n increases (reflected in formula for S.D.)

Terms to Know: sampling distribution, sample mean, unbiased estimator

Section 5.3

Chapter 6: Inference on the mean

Section 6.1: Confidence intervals

Nature of confidence intervals

Range from (estimate - margin of error) to (estimate + margin of error)
Have an associated confidence level C

C percent of the time the confidence interval for the estimated statistic contains the parameter — see Fig. 6.2, p. 438
Desirable to have C as high as possible (usual values: 90%, 95%, 97%)

Margin of error

Desirable to have as small as possible (at odds with desire for large confidence interval)
Can decrease in one of three ways (see bullets on p. 442)

Assume unbiased estimator of parameter — account only for chance variation

Confidence interval for a population mean

Underlying assumptions

Population is normally distributed

Without this, confidence won't be as great as advertised
With n ³ 15 (# of observations), no extreme outliers or skewness confidence isn't severely compromised
Data is unbiased, subject only to random sampling error

Data is an SRS of the population, or can be considered as one (not a multistage/stratified sample)

Extends from (sample mean - margin of error) to (sample mean + margin of error)
The “fine print”

If possible, explain and/or correct outliers (nonresistance of sample mean)
In practice, won't know s; might substitute sample standard deviation s if large sample size
Mustn't interpret

confidence level as a probability that true mean lies in interval; rather, as how often the method gives correct answers
confidence interval/level as a prediction that C% of observations lie inside this interval

Terms to Know: margin of error, inference, confidence level

Section 6.2: Significance Tests

Hypotheses

Null hypothesis (H₀)

A supposition about a population parameter: p = p₀ (in this section, m = m₀)
Will test compatability of H₀ with sample statistic

Alternative hypothesis (H_a)

Statement of an alternative to H₀ we suspect to be true
One-sided (H_a: p > p₀ or H_a: p < p₀) vs. two-sided (H_a: p ¹ p₀)

Test for mean

Underlying assumption: sample mean is normally-distributed as N(m, s/Ön )

True if population is normally distributed
Approximately true if sample size n is large

Compute test statistic (a z-score) for sample mean assuming hypothesized population mean
Get associated P-value (probability associated with test statistic and H_a; see box, p. 461)
Compare P-value to predetermined significance level a

a is a percentage (i.e., it is between 0 and 1)
Common levels of significance: 0.1, 0.05, 0.01

Terms to Know: null hypothesis, alternative hypothesis (1 or 2-sided), test statistic, P-value, statistical significance

Section 6.3: Use/Abuse of Tests

Significance tests are not appropriate for all data sets

Outliers can exaggerate/de-emphasize significance
Confounding is not removed by such tests

Statistical significance establishes results are unlikely due to random chance
Does not provide reason for significance (could be some suspected effect of a treatment, could be poor study design)

Significance level a

Importance of choosing a level ahead of time when decisions will be made based upon results
Choosing a level: Consider

how believable is the null hypothesis
consequences of rejecting null hypothesis

Avoid thinking of results as insignificant if a not reached, significant if it is

Misinterpreting statistical significance

Statistical significance vs. practical importance
Significance may lead to rejecting null hypothesis in favor of alternative; lack of significance only means results are consistent with null hypothesis

Danger of “searching for significance”

Section 6.4

Chapter 7

Section 7.1: Inference for the Mean of a Population

t distributions

Correct distribution for sample mean when s (for underlying population) is not known and s (the sample standard deviation) is used in its place
Description

Standardized so centered about 0
Symmetric and bell-shaped
Larger spread than normal distribution

Degrees of freedom

df = n - 1
More like N(0, 1) as df increases

One-sample t confidence intervals

Used in place of confidence interval for population mean (as learned in Section 6.1) when s (for population) is unknown
Determination of margin of error

One-sample t statistic (determined for a confidence level C from Table D) used in place of z statistic
Use standard error of sample mean in place of standard deviation for sample mean

One-sample t test

Used in place of z test (see Section 6.2, p. 461) when s unknown
Formulate null/alternative hypotheses just as usual
Determine t statistic as you would z statistic, but using SE for sample mean rather than SD
Determine P-value from appropriate t distribution (Table D)
Note method of reporting conclusion (as at end of Example 7.5, p. 511)

Matched pairs t procedures (comparative inference)

Procedures are just like above, but performed on the difference
Usually have H₀: m = 0 and one-sided alternative hypothesis

When are t procedures valid

Exactly correct when population is normal
Approximately correct when n ³ 15 except in case of outliers or strong skewness
Clear skewness (no outliers) OK if n ³ 40

Power of the t test

Section 6.4

Inference for non-normal populations

Use a known distribution that is not normal but fits well
Make a transformation that brings about normality
Use distribution-free procedures (Example: the sign test

Terms to Know: standard error, one-sample t, degrees of freedom, matched pairs test, robust

Section 7.2: Comparison of Two Means

Context of two-sample problems

Want to compare responses in two groups

Often (but not exclusively) used in comparative experiments
Usually comparisons made on groups mean responses

Groups can be considered as samples from distinct populations
Responses of units in one group independent of those in the other

Two-sample statistics

Two-sample z statistic

How the formula follows from previous (one-sample) z statistics

Two random variables, one from each group (measuring same thing, but possibly having different distributions)
Looking at difference between these variables, so sample/population mean is difference of ones for each group, variance for difference computed from individual variances via formula, p. 337

Is distributed normally (or approximately so) as N(0, 1) when underlying populations are normal (or approximately)
Used when standard deviations for underlying populations are known (somewhat unusual)

Two-sample t statistic

Used when

Samples and population distributions of both groups satisfy conditions mentioned in Section 7.1 for t procedure validity, and
standard deviations of populations are not known

Formula is one arising naturally from that for two-sample z statistic
Does not have t distribution

Is approximately t for the correct df
Best df comes from formula, p. 549 (but use software or method below instead of memorizing this)
We get good (conservative) estimate taking df = min{n₁ - 1, n₂ - 1 }

Inference on the difference of two population means

Two-sample t significance test

Null hypothesis: the population means are equal
Notation used in results
Interpretation of results

Two-sample t confidence interval

Interpretation of such an interval

Robustness

Most robust against nonnormality if sample sizes equal
If sample sizes equal and distributions of two populations the same, can take sample sizes as low as 5
Using t procedures with small samples

Optional material

Software approximation for degrees of freedom
Pooled two-sample t procedures

Terms to Know: difference of sample means, two-sample z and t statistics, conservative estimates

Section 7.3: Optional Topics in Comparing Distributions

Chapter 8

Section 8.1: Inference for 1-proportions

Large-sample confidence interval for population proportion

Basic assumptions

Population-to-sample size ratio is at least 10 (so count is approximately binomially distributed)
Sample size is large enough that expected value of successes and failures are both at least 10 (so binomial dist. well-approximated by normal dist.)

Interval is ( - m, + m )

Margin of error determined differently than for inference on a population mean (see Section 6.1)
Standard error (SE) of sample proportion: like standard deviation with sample proportion in place of true proportion (unknown parameter)
Desired level of confidence (percentage) ® z* (from ¥ row of Table D)

Large-sample significance test for population proportion (H₀: p = p₀ )

Comparison to confidence interval

Significance test good if specific (ideal) p₀ is suspected
Confidence interval provides range of compatible p

Basic assumptions: as for confidence intervals but np₀ ³ 10 and n(1 - p₀ ) ³ 10.
P -values determined from appropriate choice of P(Z £ z ) , P(Z ³ z ) or P(Z ³ | z | )

Determination of sample size

a desired margin of error m
a desired level (percentage) of confidence ® z*
a guessed p* at the true proportion (can take worst-case guess p* = 0.5 )

Terms to Know: standard error, approximate level C confidence interval, null/alternative hypothesis, P -value, z-statistic (test statistic), sample proportion

Section 8.2: Comparison of Two Proportions

Setting

Have categorical data (one variable, 2 options) for samples from two groups (populations)
Want to compare proportions between populations

Inference procedures on the difference of two population proportions (Interpretation and notation)

Confidence intervals
Tests of significance

Standard error arrived at in a somewhat different way than all previous standard errors
pooled estimate (combining of two sample proportions)
Null hypothesis: the two population proportions are equal

Optional material

Relative risk

Terms to Know: difference of sample proportions, pooled estimate

Chapter 9: Inference for Two-Way Tables

Two-way tables

Give counts for two categorical variables

Can be used for categorical information (S or F) for samples from 2 populations (like material studied in Section 8.2)
Variables may have more than two options, resulting in more rows and/or columns

Constructing them

Columns for explanatory variable, rows for response variables
Additional row/column for totals
Grand total

Row/column/marginal percents

Test of significance

Results in 2 ´ 2 case same as if you use procedures of Section 8.2
Hypotheses

Null hypothesis: No association between variables

Leads to expected cell counts
Rejected for small P-value (taken from Table F; see below) in favor of alternative hypothesis

Alternative hypothesis: association exists

Always two-sided
Exact nature of association ascertained by looking at data and should be included in answer (see, for example, the 1st paragraph on p. 632; the last full paragraph on p. 636)

chi-square statistic

df = (#rows - 1)(#cols - 1)
Distributed as c ²(df) (Table F) if

table is 2 ´ 2 and each expected cell count is at least 5
table is bigger than 2 ´ 2, each expected cell count is at least 1, average expected cell count is at least 5

Two models for two-way tables (neither of which would be open to including the same unit in counts appearing in different cells)

Explanatory variable is the population (i.e., each column represents a different population — as in male vs. female; GM cars vs. Ford vs. Chrysler)
Columns represent subdivisions within a single population (as in categorizing Americans by their age as in Table 4.1, p. 350; cats by their source as in Exercise 9.3, p. 644; etc.)

Optional material

Meta-analysis

Terms to Know: two-way table, cell, row/column percentages, expected cell counts, chi-square statistic, joint/conditional distributions

Chapter 12: One-Way ANOVA (see Powerpoint presentation)

Setting for its usage

Two variables: one categorical, one quantitative

Categorical variable usually is population (group) to which unit belongs
Extension of idea, called two-way ANOVA, can deal with two categorical, one quantitative variable

Extension of 2-sample t test

Comparison of means (of quantitative variable) between groups
Gives same results as 2-sample t test when just two groups

The model assumptions

There are I populations
A sample is drawn from each population

Sample size from 1st population is N₁, from 2 population is N₂, etc.
x_ij represents the j^th observation from the i^th group
_i represents the sample mean (statistic) within the i^th group
represents the sample mean (statistic) for all observations in all groups

Each population is normally distributed about a mean m_i with standard deviation s_i (parameters)

Assumption should be checked when possible by looking at histograms/normal quantile plots within each group

Each s_i is the same (i.e., s_i = s for each i )

If not, the problem can often (but not always) be overcome with a transform of the data
Not usually worth formal test to see if S.D.s are the same — consider OK if rule in box on middle of p. 752 is satisfied
Estimate s using pooled (sample) standard deviation s_p (s_p² = MSE in Minitab output; see formula on bottom of p. 752)

One-way ANOVA test

A test of significance

H₀: "no difference in mean between groups"
H_a: mean in at least one group differs from other groups

Test statistic

F statistic

F = MSG/MSE
Use of dfs in computing MSG/MSE from SSG/SSE

Degrees of freedom in numerator: DFG = I - 1
Degrees of freedom in denominator: DFE = N - I (N is total number of units across all groups)

Gives ratio of variation among group means to variation within groups

New distribution, called an F distribution

Table E in back of text
Requires knowledge of F statistic, df for numerator (DFG), df for denominator (DFE)

Coefficient of determination R² = SSG/SST. (Like in regression, indicates percentage of total variation in means from samples are explained by population)
If test demonstrates significance, further analysis must be done to determine how means vary between groups; some alternatives:

Graphical displays (side-by-side boxplots, histograms, etc.)
Contrasts

Preferable when investigator has predisposed opinion about how means will compare in various groups
We will not study these

Multiple comparisons

Inspect difference of means between any two groups (idea is like, though not the same as, using a 2-sample t test on each possible pairing of groups)
Tests of significance are possible on differences of these means — we will not do
Confidence intervals on pairs of differences of means: Be able to

understand/interpret Minitab output for Tuckey's/Fisher's Pairwise Comparisons
understand why the difference in individual and overall (or family ) error rates

Terms to Know: one-way ANOVA, group, variation among/within groups, ANOVA table, degrees of freedom (DFG, DFE, DFT), sum of squares (SSG, SSE, SST), mean squares (MSG, MSE), F statistic, multiple comparisons, coefficient of determination (R²), pooled standard deviation (s_p)

Chapter 13: Two-Way ANOVA

Terms to Know:

Last Modified: