## Statistics 204## Elementary Statistics |

- John Rice, UCB (course page, lecture notes)
- Jurgen Symanzik, Utah State, Logan (quizzes and exams)
- Michael Wichura, U of Chicago (course page, handouts)

- Stat 204 MWF Fall 2019 schedule

1. Controlled Experiments

- James Lind (1716–1794), Wikipedia
- one of the first clinical trials, scurvy
- Sir Austin Bradford Hill (1897–1991), Wikipedia
- streptomycin and tuberculosis, smoking and lung cancer
- Sir Richard Doll (1912–2005), Wikipedia
- smoking and lung cancer
- Jonas Salk (1914–1995), Wikipedia
- Salk polio vaccine

- James Lind, scurvy
- Sir Austin Bradford Hill, streptomycin and tuberculosis
- Jonas Salk, polio

*control* - controlled vs. randomized controlled

*placebo* - treatment vs the idea of treatment

*blind* - participants
- physicians

*modern clinical trial* - randomized
- controlled
- double blind

- clinical trial (zip html)
- polio (zip html)

2. Observational Studies

- Sir Austin Bradford Hill and Sir Richard Doll, lung cancer
- British Doctors Study
- tobacco warning messages
- smoking and cancer, National Cancer Institute
- smoking (zip html)
- minimum-wage hikes in New Jersey and Pennsylvania, Card and Krueger

3. The Histogram

- dotplots
- histograms
- barcharts
- mosaic plots
- reading: OpenIntro Statistics, dot plots and the mean, pp.28-30
- reading: OpenIntro Statistics, histograms and shape, pp.30-32
- plots (zip html)
- working on my histograms (zip html)
- education statistics (zip html)

- web app: boxplots and histograms
- web app: Describing and Exploring Categorical Data
- web app: Describing and Exploring Quantitative Variables
- random variables
*X ~ U(0, 1)*- uniform distribution (zip html)
*Z ~ N(0, 1)*- standard normal distribution (zip html)

4. The Average and the Standard Deviation

- RA Fisher (1890–1962), Wikipedia
- "a genius who almost single-handedly created the foundations for modern statistical science"
- "the single most important figure in 20th century statistics"

*E[nums]*= mean or average or expected value of a set of numbers

*standard deviation**RMS[nums]*= root-mean-square of a set of numbers- RMS (zip html)
- differences from the mean of a set of numbers = from each number subtract the mean of the numbers
*SD[nums]*= RMS[differences from the mean of a set of numbers]- SD (zip html)
- mean, standard deviation, not robust
- reading: OpenIntro Statistics, variance and standard deviation, pp.32-34
- median, quartiles, IQR, more robust
- quartiles (zip html)
- web app: Mean versus Median
- five-number summary, box and whiskers
- Shakespeare (zip html)
- the 1.5 * IQR rule for outliers
- outliers (zip html)
- Old Faithful (zip html)

5. The Normal Approximation for Data

- Abraham de Moivre (1667–1754), Wikipedia
- discovered the normal curve
- Adolphe Quetelet (1796–1874), Wikipedia
- compared histograms to normal curves

*z*-scores

*the Normal approximation for data*- draw a number line and shade the interval (p.86)
- convert to standard units and shade the standard units interval
- sketch the normal curve and shade the area under the curve and above the shaded standard units interval
- the proportion is approximately equal to the shaded area under the Normal curve

- scores (zip html)
- test scores (zip html)
- UGA (zip html)
- UGA solutions (zip html)
- the Standard Normal Distribution,
*N(0, 1)* *pnorm, qnorm*- empirical rule, 68-95-99.7% (zip html)

- web app: distribution calculator (normal, binomial, t, F, chi-squared)

6. Measurement Error

- measure the same thing a number of times
- the collection of such measurements is often approximately bell-shaped
- mean is the average measurement
- standard deviation is the likely size of chance measurement error

*rounding* - rules for rounding numbers
- significant digits
- never round numbers in the middle of an analysis
- appropriately round results at the end of an analysis

7. Plotting Points and Lines

- coordinates of a point
*(x, y)* - a = y-intercept
- b = slope
- y = a + bx

8. Correlation

- Karl Pearson (1857–1936), Wikipedia
- founded mathematical statistics and the first university statistics department
- Pearson's correlation coefficient

- scatterplots
- Karl Pearson,
*r* - correlation coefficient,
*r*(zip html) - gallery of correlation coefficients (zip html)

- web app: guess the correlation
- web app: Association Between Two Quantitative Variables
- five-number summary:
*x.bar, SDx, y.bar, SDy, r*

9. More about Correlation

*r(x, y) = r(y, x)**r(a + x, y) = r(x, y)**r(a * x, y) = r(x, y)**r*measures linear association, not association in general- ecological correlations
- association is not causation

10. Regression

- the graph of averages
- smoothers, smoothing local regression
- the regression line
- when
*x*increases by*SDx*,*y*increases by*r * SDy* - regression toward the mean
- regression fallacy
- two regression lines:
*y*on*x*and*x*on*y*

- web app: simple linear regression

11. The R.M.S. Error for Regression

- residuals
*residual: error = y - predicted y**residual: e = y - y.hat*- RMS error =
*RMS[residuals]* - residual plot
- vertical strips
- normal distributions inside vertical strips
*RMS error for regression line = sqrt(1 - r^2) * SD of y*

- web app: diagnostics for simple linear regression

12. The Regression Line

- Alice Lee (1858–1939), Wikipedia
- one of the first women to graduate from London University
- worked with Karl Pearson from 1892

- slope and intercept
- slope =
*r * SDy / SDx* - regression line goes through the point of averages
*(x.bar, y.bar)* - for an observational study, the regression line only describes the data; it is not predictive
- least squares and "the best-fitting line"
- the best-fitting line minimizes RMS error
- R's procedure
*lm*

*Pearson and Lee, mothers and daughters* - mothers and daughters (zip html)

13. What Are the Chances?

- Venn diagrams
- probability of a complement
- probability of a union
- probability of an intersection

*conditional probability* - probability of A given B
- 2x2 contingency tables
- mosaic plots

14. More about Chance

- mutual exclusion
- independence

15. The Binomial Formula

- Blaise Pascal (1623–1662), Wikipedia
- built a mechanical calculator while still a teenager
- co-founded probability theory through correspondence with Pierre de Fermat
- Pierre de Fermat (1607–1665), Wikipedia
- a French judge who was very interested in mathematics
- co-founded probability theory through correspondence with Blaise Pascal

*n*choose*k*- binomial coefficient
- Pascal's triangle

*binomial distribution* - probability of
*k*heads in*n*flips of a coin - what is the shape of the binomial distribution ...
- if the probability of a head is near 0.5?
- if the probability of a head approaches 0 or 1?

- web app: the binomial distribution

- chapters 16-18 are about the sampling distribution of a statistic
- we are especially interested in the distribution of the
*sum*of the draws from a box model

- John Kerrich (1903–1985), Wikipedia
- mathematician noted for a series of experiments in probability
- which he conducted while interned in Nazi-occupied Denmark in the 1940s

- John Kerrich
- coin flipping

*the law of averages* - there is chance error in the number of heads
- number of heads = (number of tosses)/2 + chance error
- the error is likely to be large in absolute terms
- but small relative to the number of tosses
- the probability of heads on a single toss is always the same

*box models* - draw from a box
- specify the numbers in the box
- and how many of each kind
- specify the number of draws,
*n* - we are often interested in
*X*= sum of the draws

17. The Expected Value and Standard Error

- sum of the draws from a box
- expected value for sum of the draws = number of draws * average of the box
*E[sum] = n * mu*

*standard error for sum of the draws, SE[sum]*- standard error for sum of the draws = sqrt(n) * SD of the box
- SD of the box = RMS of the deviations from the average of the numbers in the box
- deviation from the average of a number in the box = x - mu
- there is a short-cut for calculating the SD of the box if the box contains only two numbers
- if the only numbers in the box are 0s and 1s, then ...
*SD of the box = sqrt(p * (1 - p))*where*p*= proportion of 1s in the box- in this special case,
*SE[sum] = sqrt(n) * SD of the box = sqrt(n) * sqrt(p * (1 - p))*

*interpretation* - the sum of the draws is likely to be around
*E[sum]*, give or take*SE[sum]*or so

*simulation* - draw lots of tickets from the box
- display the results in a histogram
- simulated sampling distribution

*normal distribution* - sum of the draws from a box
- what is the chance that the sum falls in a given interval,
*(a, b)*? - convert the interval to standard units using
*E[sum]*and*SE[sum]* *(a.prime, b.prime) = ((a - E[sum])/SE[sum], (b - E[sum])/SE[sum])*- and then measure the area of the region under the standard normal curve and above the transformed interval
- that area is the desired probability

*box models* *box model for a sum*- roll a die 10 times
- how many spots?
- numbers on tickets
- answer is the distribution of the
*sum*of the draws *box model for a count*- how many 2s?
- 0s and 1s on tickets
- answer is the distribution of the
*sum*of the draws

- gambling and unfair bets

- web app: sampling and standard error

18. The Normal Approximation for Probability Histograms

- simulation
- probability histograms
- normal approximations
- sampling distribution of a sum

- web app: Central Limit Theorem for Proportions
- web app: Central Limit Theorem for Means

- chapters 19-24 are about estimating parameters from data
- and calculating margins of error (p.486)

- George Gallup (1901–1984), Wikipedia
- American pioneer of survey sampling techniques

*famous polls* - Literary Digest
- Roosevelt and Landon, 1936
- the poll that changed polling
- Truman and Dewey, 1948
- bias
- polling (zip html)
- simple random sampling
- stratified sampling
- cluster sampling
- multistage sampling
- reading: OpenIntro Statistics, four sampling methods, pp.20-23
- random selection and random assignment (zip html)

20. Chance Errors in Sampling

- assume simple random sampling from a binary population (0s and 1s)
- box model for simple random sample
- contents of the box are known
*pi*= proportion of 1s in the box- what can be said of a simple random sample from the box?
- expected value of a sample proportion
- E[sample proportion] = pi
*SD[box] = sqrt(pi * (1 - pi))**SE[count] = sqrt(n) * SD[box]**SE[proportion] = SD[box]/sqrt(n)*- proportion of 1s in the sample is likely to be
*E[proportion]*, give or take*SE[proportion]*or so

*probabilities* - what is the probability that the sample proportion falls in the interval
*(a, b)*? - convert
*(a, b)*to standardized coordinates using*E[proportion]*and*SE[proportion]* - the probability is the area under the standard normal curve and above the standardized interval

21. The Accuracy of Percentages

- box model
- box with
*N*tickets of 0s and 1s *pi*is the (unknown) population proportion of 1s in the box

*point estimate of**pi*- draw a sample of size
*n*from the box *p.hat*is the proportion of 1s in the sample*p.hat*is the point estimate of*pi*

*SE of the sample proportion*- SD of the box is unknown
- use the bootstrap to estimate SD of the box
- substitute sample statistics for the unknown population parameters
- SD of the box is approximately
*sqrt(p.hat * (1 - p.hat))* - SE of the sample sum is
*sqrt(n) * SD of the box* - SE of the sample proportion is
*SD of the box / sqrt(n)*

*confidence interval for population proportion pi* *CI = point estimate plus or minus 2 * SE*- for a confidence level of about 95%

- web app: Explore Coverage of Confidence Intervals
- the randomness is in the sampling procedure, not in the population parameter

*Gallop poll* - compare Gallup errors in presidential election polling to SE of simple random samples
- Gallup errors are larger
- Gallup does not use simple random samples

23. The Accuracy of Averages

*E[average of draws] = average of the box*- box is unknown, estimate average of the box with average of the draws
*SE[average of draws] = SE[sum of the draws]/number of draws**SE[sum of the draws] = sqrt(n) * SD[box]*, but the box is unknown- approximate SD[box] with SD[draws]
*SE[sum of the draws] = sqrt(n) * SD[draws] = sqrt(n) * s**SE[average of draws] = SD[draws]/sqrt(n) = s/sqrt(n)*- the average of the draws will be about
*E[average of draws]*, give or take*SE[average of draws]*or so - normal approximation for the average of the draws
- multiplying the number of draws
*n*by a factor divides*SE[average of draws]*by*sqrt(factor)*

*inference* - using the sample to infer properties of the box

*which SE?* *quantitative data:**SE[sum] = sqrt(n) * SD of the box**SE[average] = SE[sum]/n = SD[box]/sqrt(n)**categorical data:**SE[count] = SE[sum] from a 0-1 box**SE[proportion] = SE[count]/n = SD[box]/sqrt(n)**SE[sum]*is basic, and the other formulas follow from it

*examples*- roll a die 10 times
*quantitative data:*- how many dots?
- proportion of dots?
*categorical data:*- how many 2s?
- proportion of 2s?

24. A Model for Measurement Error

25. Chance Models in Genetics

26. Tests of Significance

- Jerzy Neyman (1894–1981), Wikipedia
- randomized experiments, stratified sampling, confidence interval, hypothesis testing
- influential professor of statistics at UCB for many years

- test whether a parameter is equal to a specific value, or not (p.486)
- null hypothesis
- alternative hypothesis
- test statistic
*z = (observed - expected)/SE*- p-value
- p-value measures the strength of evidence against the null hypothesis

*misinterpretation* - p-value is NOT the probability that the null hypothesis is wrong
- the null hypothesis is a statement about the box
- the null hypothesis is either always right or always wrong
- there is no probability involved

*one sample t-test*

- web app: interactive Student's t distribution

27. More Tests for Averages

- quantitative data
- z-test for the difference of two averages
- categorical data
- z-test for the difference of two proportions

*randomized controlled experiments*- comparing outcomes in treatment and control groups

- web app: Inference for Comparing Two Population Proportions
- web app: Inference for Comparing Two Population Means

28. The Chi-Square Test

- test for goodness of fit
- test of independence of two categorical variables

- web app: Analyzing the Association between Two Categorical Variables
- web app: Chi-Squared Test

29. A Closer Look at Tests of Significance