MATH 3339 – 01 (11911) – Statistics for the SciencesHomework 2
Summer 2022
Directions: Due June 20 at 11:59 pm. Upload in BlackBoard. Show your work not just the
answers. You can type it or write the results and then scan the results into Blackboard. The
textbook can be found in CASA https://www.casa.uh.edu you will need an access code
from the bookstore.
Problem 1. Textbook section 3.3.1
a) Problem 2
b) Problem 3
c) Problem 4
d) Problem 5
Problem 2. Textbook section 3.3.1
a) Problem 7
b) Problem 8
Problem 3. Textbook section 3.5.1, problem 2
Problem 4. Textbook section 3.5.1
a) Problem 4
b) Problem 5
Problem 5. Textbook section 3.5.1, problem 6
Problem 6. Textbook section 3.7.1, problem 1
1
Problem 7. Textbook section 3.7.1, problem 4, The variable names are different
• AREA = Region
• INC = Income
• POL = Politics
Use Region, Income and Politics instead.
Problem 8. Given the following probabilities:
P (A) = 0.5
P (B) = 0.6
P (A ∩ B) = 0.3 P (A ∩ C) = 0.2
P (C) = 0.6
P (B ∩ C) = 0.3 P (A ∩ B ∩ C) = 0.1
Determine the following probabilities.
a) P (A ∪ B)
b) P (A ∩ B c )
c) P (Ac ∩ B c ∩ C c )
d) P (A|B)
e) P (B|C)
f) P (A|B c )
g) P (B c |A ∩ C)
Problem 9. Let A and B be two events. Suppose that P (A) = 0.4, P (B) = p, and P (A ∪ B) =
0.8.
a) For what value of p will A and B be mutually exclusive?
b) For what value of p will A and B be independent?
Problem 10. From Statistics and Data Analysis from Elementary to Intermediate by Tamhane
and Dunlop, pg 67. The accuracy of a medical diagnostic test, in which a positive result indicates
the presence of a disease, is often state in terms of its sensitivity, the proportion of diseased people
that test positive or P (+|Disease), and its specificity, the proportion of people without the disease
who test negative or P (−|No Disease). Suppose that 10% of the population has the disease (called
the prevalence rate). A diagnostic test for the disease has 99% sensitivity and 98% specificity.
Therefore,
P (+|Disease) = 0.99, P (−|No Disease) = 0.98,
P (−|Disease) = 0.01, P (+|No Disease) = 0.02
a) A person’s test result is positive. What is the probability that the person actually has the
disease?
2
b) A person’s test result is negative. What is the probability that the person actually does not
have the disease? Considering this result and the result form (a), would you say that this
diagnostic test is reliable? Why or why not?
c) Now suppose that the disease is rare with a prevalence rate of 0.1%. Using the same diagnostic test, what is the probability that the person who tests positive actually had the disease?
d) The results from (a) and (c) are based on the same diagnostic test applied to population with
very different prevalence rates. Does this suggest any reason why mass screening programs
should not be recommended for a rare disease? Justify your answer.
3
Go to TOC
Statistics for the Sciences
Charles Peters
Contents
Go to TOC
1 Background
1.1 Overview and Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Populations, Samples and Variables . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Random Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7 Computing in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
2
3
4
4
5
6
7
7
2 Descriptive and Graphical Statistics
2.1 Location Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Repeated Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.4 Other Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.5 Trimmed Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.6 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.7 The Five Number Summary . . . . . . . . . . . . . . . . . . . . . . . .
2.1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Grouped Data, Histograms, and Cumulative Frequency Diagrams . . . . . . .
2.2.1 Frequency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Cumulative Frequency Diagrams . . . . . . . . . . . . . . . . . . . . .
2.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Measures of Variability or Scale . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 The Variance and Standard Deviation . . . . . . . . . . . . . . . . . .
2.3.2 The Mean and Median Absolute Deviation . . . . . . . . . . . . . . .
2.3.3 The Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Factor Variables and Barplots . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
10
10
11
11
11
12
12
14
15
15
15
16
18
21
22
22
23
24
26
26
28
29
1
CONTENTS
2
2.5.1 Tabulated Factor Variables . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Two Factor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 One Factor and One Numeric Variable . . . . . . . . . . . . . . . . . .
2.6.3 Two Numeric Variables . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
31
32
32
34
35
39
3 Probability
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Combinations of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Rules for Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Counting Outcomes. Sampling with and without Replacement . . . . . . . .
3.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Relating Conditional and Unconditional Probabilities . . . . . . . . .
3.6.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Replications of a Random Experiment . . . . . . . . . . . . . . . . . . . . . .
41
41
41
43
44
45
46
48
48
50
51
52
52
53
4 Discrete Distributions
4.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Expected Values of Discrete Variables . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 The Mean and Variance of a Bernoulli Variable . . . . . . . . . . . . .
4.5 Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 The Mean and Variance of a Binomial Distribution . . . . . . . . . . .
4.6 Cumulative Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Hypergeometric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.1 The Mean and Variance of a Hypergeometric Distribution . . . . . . .
4.8 Poisson Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.1 The Mean and Variance of a Poisson Distribution . . . . . . . . . . .
4.8.2 The Poisson Approximation to the Binomial Distribution . . . . . . .
4.8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . .
4.10 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
55
56
58
59
60
60
61
62
63
64
66
67
68
70
70
71
72
73
75
77
2.6
Go to TOC
CONTENTS
3
5 Continuous Distributions
79
5.1 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Expected Values and Quantiles for Continuous Distributions . . . . . . . . . 83
5.2.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 The Mean, Variance and Quantile Function of a Uniform Distribution 87
5.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Exponential Distributions and Their Relatives . . . . . . . . . . . . . . . . . . 88
5.4.1 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.2 Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.3 Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.5 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.5.1 Tables of the Standard Normal Distribution . . . . . . . . . . . . . . . 97
5.5.2 Other Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.3 The Normal Approximation to the Binomial Distribution . . . . . . . 100
5.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 Joint Distributions and Sampling Distributions
102
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Jointly Distributed Continuous Variables . . . . . . . . . . . . . . . . . . . . . 102
6.2.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.2 Bivariate Normal Distributions . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Mixed Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.5.1 Simulating Random Samples . . . . . . . . . . . . . . . . . . . . . . . 113
6.6 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.7 Other Distributions Associated with Normal Sampling . . . . . . . . . . . . . 120
6.7.1 Chi Square Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.7.2 Student t Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7.3 The Joint Distribution of the Sample Mean and Variance . . . . . . . 125
6.7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7 Statistical Inference for a Single Population
127
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2.2 Desirable Properties of Estimators . . . . . . . . . . . . . . . . . . . . 128
7.3 Estimating a Population Mean . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3.1 Finding the Required Sample Size . . . . . . . . . . . . . . . . . . . . 129
7.3.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Go to TOC
CONTENTS
7.4
7.5
7.6
4
7.3.3 Small Sample Confidence Intervals for a Normal Mean . . . . . . . . . 132
7.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Estimating a Population Proportion . . . . . . . . . . . . . . . . . . . . . . . 137
7.4.1 Choosing the Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.4.2 Confidence Intervals for p . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Estimating Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Estimating the Variance and Standard Deviation . . . . . . . . . . . . . . . . 143
7.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8 Hypothesis Testing
145
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2 Test Statistics – Type 1 and Type 2 Errors . . . . . . . . . . . . . . . . . . . . 145
8.3 Hypotheses About a Population Mean . . . . . . . . . . . . . . . . . . . . . . 146
8.3.1 Large sample tests for the mean when the variance is unknown . . . . 147
8.3.2 Student t Tests for Small Samples . . . . . . . . . . . . . . . . . . . . 148
8.4 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4.1 Using R’s t.test function . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.5 Hypotheses About a Population Proportion . . . . . . . . . . . . . . . . . . . 153
8.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9 Regression and Correlation
157
9.1 Examples of Linear Regression Problems . . . . . . . . . . . . . . . . . . . . . 157
9.2 Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.2.1 The ”lm” Function in R . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.3 Distributions of the Least Squares Estimators . . . . . . . . . . . . . . . . . . 167
9.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.4 Inference for the Regression Parameters . . . . . . . . . . . . . . . . . . . . . 169
9.4.1 Confidence Intervals for the Parameters . . . . . . . . . . . . . . . . . 171
9.4.2 Hypothesis Tests for the Parameters . . . . . . . . . . . . . . . . . . . 172
9.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.5.1 Confidence intervals for ρ . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10 Inferences on Two Groups or Populations
182
10.1 Large Sample Comparison of Means . . . . . . . . . . . . . . . . . . . . . . . 182
10.2 Large Sample Comparison of Proportions . . . . . . . . . . . . . . . . . . . . 184
10.3 Testing Equality of Population Proportions . . . . . . . . . . . . . . . . . . . 185
10.3.1 Comparing Proportions with R . . . . . . . . . . . . . . . . . . . . . . 185
10.4 Small Sample Comparison of Normal Means . . . . . . . . . . . . . . . . . . . 186
10.4.1 The Welch Test and Confidence Interval . . . . . . . . . . . . . . . . . 186
10.4.2 The t-test with Equal Variances . . . . . . . . . . . . . . . . . . . . . 187
10.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Go to TOC
CONTENTS
5
10.5 Paired Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10.5.1 Crossover Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.5.2 Estimating the Size of the Effect . . . . . . . . . . . . . . . . . . . . . 193
10.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
11 Analysis of Variance
194
11.1 Single Factor Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . 194
11.1.1 Mathematical Description of One Way Anova . . . . . . . . . . . . . . 195
11.1.2 Anova Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.1.3 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
11.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11.2 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.2.1 Two-way ANOVA with Replications . . . . . . . . . . . . . . . . . . . 207
11.2.2 The Additive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
12 Analysis of Categorical Data
211
12.1 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
12.1.1 Estimators and Hypothesis Tests for the Parameters . . . . . . . . . . 212
12.1.2 Multinomial Probabilities That Are Functions of Other Parameters . . 213
12.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
12.2 Testing Equality of Multinomial Probabilities . . . . . . . . . . . . . . . . . . 216
12.3 Independence of Attributes: Contingency Tables . . . . . . . . . . . . . . . . 218
12.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
13 Miscellaneous Topics
223
13.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
13.1.1 Inferences Based on Normality . . . . . . . . . . . . . . . . . . . . . . 224
13.1.2 Using R’s ”lm” Function for Multiple Regression . . . . . . . . . . . . 225
13.1.3 Factor Variables as Predictors . . . . . . . . . . . . . . . . . . . . . . . 228
13.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
13.2 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
13.2.1 The Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
13.2.2 The Mean and Variance of V and V+ . . . . . . . . . . . . . . . . . . 236
13.2.3 Confidence Intervals for the Location Parameter ∆ . . . . . . . . . . . 238
13.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
13.2.5 The Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . 239
13.2.6 Estimating the Shift Parameter . . . . . . . . . . . . . . . . . . . . . . 240
13.2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
13.3 Bootstrap Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 242
13.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Go to TOC
Chapter 1
Background
1.1
Go to TOC
Overview and Basic Concepts
Statistics is the art of summarizing data, depicting data, extracting information from data,
and inferring properties of data sources. Statistics and the theory of probability are often
conflated in popular discussion. They are distinct subjects, although statistics depends on
probability to quantify the strength of its inferences. Only a small amount of probability is
required for this book. It will be developed in Chapter 3 and throughout the text as needed.
We begin by introducing some basic ideas and terminology.
1.2
Populations, Samples and Variables
A population is a set of individual elements whose collective properties are the subject of investigation. Usually, populations are large collections whose individual members cannot all
be examined in detail. In statistical inference a manageable subset of the population is selected according to certain sampling procedures and properties of the subset are generalized
to the entire population. These generalizations are accompanied by statements quantifying their accuracy and reliability. The selected subset is called a sample from the population.
Examples:
(a) the population of registered voters in a congressional district,
(b) the population of U.S. adult males,
(c) the population of currently enrolled students at a certain large urban university,
(d) the population of all transactions in the U.S. stock market for the past month,
(e) the population of all peak temperatures at points on the Earth’s surface over a given
time interval.
Some samples from these populations might be:
(a) the voters contacted in a pre-election telephone poll,
6
CHAPTER 1. BACKGROUND
7
(b) adult males interviewed by a TV reporter,
(c) the dean’s list,
(d) transactions recorded on the books of Morgan Stanley,
(e) peak temperatures recorded at several weather stations.
Clearly, for these particular samples, some generalizations from sample to population would
be highly questionable. With the possible exception of (e), none of these subsets would
be accepted as scientifically valid samples allowing generalization to the entire population.
Researchers have devised a number of procedures for obtaining samples that allow generalization with a specified degree of confidence. The names of some common procedures are
simple random sampling, cluster sampling, and stratified sampling. We will focus for the
most part on procedures based on simple random sampling. Simple random sampling will
be described in Chapters 3 and 4.
A population variable is an attribute that has a value for each individual in the population. In other words, it is a function from the population to some set of possible values. It
may be helpful to imagine a population as a spreadsheet with one row or record for each
individual member. Along the ith row, the values of a number of attributes of the ith individual are recorded in different columns. The column headings of the spreadsheet can be
thought of as the population variables. For example, if the population is the set of currently
enrolled students at the urban university, some of the variables are academic classification,
number of hours currently enrolled, total hours taken, grade point average, gender, ethnic
classification, major, and so on. Variables, such as these, that are defined for the same
population are said to be jointly observed or jointly distributed. General results pertaining
to jointly distributed variables are presented in Chapter 6.
1.3
Types of Variables
Variables are classified according to the kinds of values they have. The three basic types
are numeric variables, factor variables, and ordered factor variables. Numeric variables are
those for which arithmetic operations such as addition and subtraction make sense. Numeric variables are often related to a scale of measurement and expressed in units, such
as meters, seconds, or dollars. Factor variables are those whose values are mere names, to
which arithmetic operations do not apply. Factors usually have a small number of possible
values. These values might be designated by numbers. If they are, the numbers that represent distinct values are chosen merely for convenience. The values of factors might also
be letters, words, or pictorial symbols. Factor variables are also called nominal variables
or categorical variables. Ordered factor variables are factors whose values are ordered in
some natural and important way. Some textbooks have a more elaborate classification of
variables, with various subtypes. The three types above are enough for our purposes.
Examples: Consider the population of students currently enrolled at a large university.
Each student has a residency status, either resident or nonresident. Residency status is an
unordered factor variable. Academic classification is an ordered factor with values “freshman”, “sophomore”, “junior”, “senior”, “post-baccalaureate” and “graduate student”. The
number of hours enrolled is a numeric variable with integer values. The distance a student
Go to TOC
CHAPTER 1. BACKGROUND
8
travels from home to campus is a numeric variable expressed in miles or kilometers. Home
area code is an unordered factor variable whose values happen to be designated by numbers.
1.4
Distributions
Let X be the name of a population variable such as the distance from home to campus for
students at the large urban university. The values assumed by X for the individual memebers of the population have a distribution. By this we mean that if a particular subset of
possible values of X is given, there is a certain proportion of the members of the population
whose X-values belong to that subset. This assignment of proportions to sets of possible
values is called the distribution of X.
For very large populations or for variables with many distinct values, statisticians may
employ mathematical models of distributions. One of the goals of statistical inference is to
find the best mathematical model from a given class of models. In Chapter 4 we describe
some of the most useful discrete model distributions, including the binomial, hypergeometric, poisson, and multinomial distributions. These are mainly for numeric variables that
have integer values or for factor variables. In Chapter 5 we discuss some of the continuous
model distributions: exponential, normal, gamma and others. These are mainly for numeric
variables recorded with high precision.
1.5
Random Experiments
An experiment can be something as simple as flipping a coin or as complex as conducting a
public opinion poll. A random experiment is one with the following two characteristics:
(1) The experiment can be replicated an indefinite number of times under essentially the
same experimental conditions.
(2) There is a degree of uncertainty in the outcome of the experiment. The outcome may
vary from replication to replication even though experimental conditions are the same.
When we say that an experiment can be replicated under the same conditions, we mean
that controllable or observable conditions that we think might affect the outcome are the
same. There may be hidden conditions that affect the outcome, but we cannot account for
them. Implicit in (1) is the idea that replications of a random experiment are independent,
that is, the outcomes of some replications do not affect the outcomes of others. Obviously, a
random experiment is an idealization of a real experiment. Some simple experiments, such
as tossing a coin, approach this ideal closely while more complicated experiments may not.
There are two broad categories of random experiments. Designed experiments are those
in which the researcher exercises control over some of the conditions that affect the outcome. Observational studies are experiments in which control is not possible, although
variables associated with the primary outcome may be observed. Experimental design is an
extensively studied aspect of statistics. We do not treat it here because it is impossible to do
Go to TOC
CHAPTER 1. BACKGROUND
9
it justice in a relatively short introductory textbook and because the design of experiments
can be left to more discipline-specific applied statistics courses.
Examples
(a) An engineer compares several brands of 9 volt batteries by placing samples of each
brand all under the same load and observing the times it takes each battery to decrease to
several pre-assignmed voltage levels. This is a designed experiment because the engineer
controls the voltages at which she observes the times. The times observed for each battery
are the outcomes of the experiment. Presumably if the experiment were to be replicated
with different samples of the same brands, the observed times would be different. Hence,
the outcomes of different replications are uncertain.
(b) Two astronomers investigate the spectral class (basically the surface temperature) and
the absolute luminosity of a large sample of stars in an effort to discover a relationship between these variables. This is an observational study. The astronomers have no control over
any stellar variables. Another astronomer observing a different sample of stars would observe different temperatures and luminosities. (N.B.This is an actual experiment carried out
by Ejnar Hertzsprung and Henry Norris Russell in 1910. It resulted in a famous scatterplot
known as the H-R diagram. We will encounter scatterplots in Chapter 2.)
1.6
Sample Spaces
The sample space of a random experiment is the set of all its possible outcomes. We use
the Greek capital letter Ω (omega)to denote the sample space. There may be some degree
of arbitrariness in the description of Ω depending on how the outcomes of the experiment
are represented symbolically.
Examples:
(a) Toss a coin. Ω = {H, T }, where “H” denotes a head and “T” a tail. Another way
of representing the outcome is to let the number 1 denote a head and 0 a tail (or vice-versa).
If we do this, then Ω = {0, 1}. In the latter representation the outcome of the experiment
is just the number of heads.
(b) Toss a coin 5 times, i.e., replicate the experiment in (a) 5 times. An outcome of this
experiment is a 5 term sequence of heads and tails. A typical outcome might be indicated
by (H,T,T,H,H), or by (1,0,0,1,1). Even for this little experiment it is cumbersome to list
all the outcomes, so we use a shorter notation
Ω = {(x1 , x2 , x3 , x4 , x5 ) | For each i xi = 0 or xi = 1} .
(c) Select a student randomly from the population of all currently enrolled students. The
sample space is the same as the population. Here, the word “randomly” is vague. We will
give it a precise definition later.
Go to TOC
CHAPTER 1. BACKGROUND
10
(d) Repeat the Michelson-Morley experiment to measure the speed of the Earth relative to
the ether (which doesn’t exist, as we now know). The outcome of the experiment could conceivably be any nonnegative number, so we take Ω = [0, ∞) = {x | x is a real number and x ≥ 0.}
Uncertainty arises from the fact that this is a very delicate experiment with several sources
of unpredictable error.
(e) Randomly select a 5 card draw poker hand from the standard deck of 52 cards. The
sample space is the collection of all subsets of size 5 of the set of 52 cards in a standard deck.
(f) Replicate the Hertzsprung-Russell experiment with a sample of n stars. The sample space
is the set of all n-term sequences of pairs of positive numbers (xi , yi ), i = 1, · · · , n, where
xi is the surface temperature of the ith star and yi is its luminosity. xi and yi presumably
could be any positive numbers.
1.7
Computing in Statistics
Even moderately large data sets cannot be managed effectively without a computer and
computer software. Furthermore, much of applied statistics is exploratory in nature and
cannot be carried out by hand, even with a calculator. Spreadsheet programs, such as
Microsoft Excel, are designed to manipulate data in tabular form and have functions for
performing the common tasks of statistics. In addition, many add-ins are available, some of
them free, for enhancing the graphical and statistical capabilities of spreadsheet programs.
Because it is so common in the business world, it is very beneficial for students to have some
experience with Excel or a similar program.
The disadvantages of spreadsheet programs are their dependence on the spreadsheet data
format with cell ranges as input for statistical functions, their lack of flexibility, and their
relatively poor graphics. Many highly sophisticated packages for statistics and data analysis are available. Some of the best known commercial packages are Minitab, SAS, SPSS,
Splus, Stata, and Systat. The package used in this text is called R. It is an open source
implementation of the same language used in Splus and it may be downloaded free at
http://www.r-project.org .
Since its first release in 2000 R has grown explosively, both in capabilities and in world-wide
popularity. It is now one of the most widely used packages for science and engineering applications. For the foreseeable future it will be one of the top data analysis systems available.
After downloading and installing R we recommend that you download and install another
free package called Rstudio. It can be obtained from
http://www.rstudio.com .
Rstudio makes importing data into R much easier and makes it easier to integrate R output
with other programs and applications.
Go to TOC
CHAPTER 1. BACKGROUND
11
Our approach to teaching R is to teach by example. Detailed instructions on using R
and Rstudio for some of the exercises will be provided. R solutions for many of the exercises
may readily be obtained by emulating examples worked out in the text. There are many
instructional videos for R on Youtube. There are also abundant internet resources which you
will find if you Google ”R”. Two of the most useful are the manual provided by the R project
https://cran.r-project.org/doc/manuals/R-intro.html
and the simpleR documentation by John Verzani at
https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
Go to TOC
1.8
Data Sources
Data files used in this course are from four sources. Some are local in origin and come from
student or course data at the University of Houston. Others are simulated but made to look
as realistic as possible. These and others are available at
http://www.math.uh.edu/ charles/data .
Many data sets are included with R in the datasets library and other contributed packages.
We will refer to them frequently. The main external sources of data are the data archives
maintained by the Journal of Statistics Education.
www.amstat.org/publications/jse
and the Statistical Science Web:
http://www.stasci.org/datasets.html.
1.9
Exercises
1. Go to http://www.math.uh.edu/ charles/data. Examine the data set “Air Pollution Filter Noise”. Identify the variables and give their types.
2. Highlight the data in Air Pollution Filter Noise. Include the column headings but not
the language preceding the column headings. Copy and paste the data into a plain text file,
for example with Notepad in Windows. Import the text file into Excel or another spread
sheet program. Create a new folder or directory named “math3339” and save both files there.
3. Start R by double clicking on the big blue R icon on your desktop. Click on the file menu
at the top of the R Gui window. Select “change dir . . . ” . In the window that opens next,
find the name of the directory where you saved the text file and double click on the name
of that directory. Suppose that you named your file “apfilternoise”. (Name it anything you
CHAPTER 1. BACKGROUND
12
like.) Import the file into R with the command
> apfilternoise=read.table(”apfilternoise.txt”,header=T)
and display it with the command
> apfilternoise
Click on the file menu at the top again and select “Exit”. At the prompt to save your
workspace, click “Yes”. If you open the folder where your work was saved you will see another big blue R icon. If you double click on it, R will start again and your previously saved
workspace will be restored.
If you use Rstudio for this exercise you can import apfilternoise into R by clicking on the
”Import Dataset” tab. This will open a window on your file system and allow you to select
the file you saved in Exercise 2. The dialog box allows you to rename the data and make
other minor changes before importing the data as a data frame in R.
4. If you are using Rstudio, click on the ”Packages” tab and then the word ”datasets”. Find
the data set ”airquality” and click on it. Read about it. If you are using R alone, type
> help(airquality)
at the command prompt > in the Console window.
Then type
> airquality
to view the data. Could ”Month” and ”Day” be considered ordered factors rather than numeric variables?
5. A random experiment consists of throwing a standard 6-sided die and noting the number
of spots on the upper face. Describe the sample space of this experiment.
6. An experiment consists of replicating the experiment in exercise 5 four times. Describe
the sample space of this experiment. How many possible outcomes does this experiment
have?
7. A random experiment consists of tossing a coin 4 times. Describe the sample space of
this experiment. In what proportion of all outcomes of the experiment will there be exactly
2 heads?
8. The airquality data set has 153 rows, one for each day in May through September of
1973. One of the variables is named ”Wind”, for wind speed. We will calculate some values
of the distribution of Wind using R. Suppose we are interested in the proportion of days
Go to TOC
CHAPTER 1. BACKGROUND
13
for which the wind speed was greater than 12. Attach the airquality data frame to your R
workspace with the command
> attach(airquality)
This allows you to address the variable Wind in R without going through bothersome intermediate steps. The number of days for which Wind was greater than 12 is given by
> sum(Wind > 12)
and the proportion is
> sum(Wind > 12)/153
Find the proportion of days for which Wind is less than or equal to 10. (Hint: Less than or
equal in R is denoted by 0 are constants and y = a + bx, ȳ = a + bx̄. Other location measures
introduced below behave in the same way.
2.1.2
Repeated Values
Go to TOC
When there are repeated values of x, there is an equivalent formula for the mean. Let the m
distinct values of x be denoted
v1 , . . . , vm . Let
Pby
Pmni be the number of times vi is repeated
m
and let fi = ni /n. Note that i=1 ni = n and i=1 fi = 1. Then the average is given by
m
X
m
1X
n i vi .
f i vi =
x̄ =
n i=1
i=1
(2.2)
The number ni is the frequency of the value vi and fi is its relative frequency.
2.1.3
The Median
Let x be a numeric variable with values x1 , x2 , . . . , xn . Arrange the values in increasing
order x(1) ≤ x(2) ≤ . . . ≤ x(n) . The median of x is a number median(x) such that at least
half the values of x are ≤ median(x) and at least half the values of x are ≥ median(x).
This is the essential idea but unfortunately there may be an interval of numbers that satisfy
this definition rather than a single number. The ambiguity is usually resolved by taking the
median to be the midpoint of that interval. Thus, if n is odd, n = 2k + 1, where k is a
positive integer, and
median(x) = x(k+1) ,
(2.3)
while if n is even, n = 2k, and
median(x) =
2.1.4
x(k) + x(k+1)
.
2
(2.4)
Other Quantiles
Let p ∈ (0, 1) be a number between 0 and 1. The pth quantile of x is more commonly known
as the 100pth percentile; e.g., the 0.8 quantile is the same as the 80th percentile. We define
it as a number q(x, p) such that the fraction of values of x that are ≤ q(x, p) is at least p
and the fraction of values of x that are ≥ q(x, p) is at least 1 − p. For example, at least 80
percent of the values of x are ≤ the 80th percentile of x and at least 20 percent of the values
of x are ≥ its 80th percentile. Again, this may not define a unique number q(x, p). Software
packages such as R have rules for resolving the ambiguity, but the details are usually not
important. If you are doing an exercise by hand, any number that satisfies the criteria for
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
16
q(x, p) is an acceptable answer. If you are using R you may accept the answer returned by R.
The median is the 50th percentile, i.e., the 0.5 quantile q(x, 0.50). The 25th and 75th percentiles q(x, 0.25) and q(x, 0.75) are called the first and third quartiles. The 10th , 20th , 30th ,
etc. percentiles are called the deciles. All quantiles including the median are location measures as defined above: if y = a + bx, where a and b > 0 are constants, then q(y, p) =
a + bq(x, p).
2.1.5
Trimmed Means
Trimmed means of a variable x are obtained by finding the mean of the values of x excluding
a given percentage of the largest and smallest values. For example, the 5% trimmed mean
is the mean of the values of x excluding the largest 5% of the values and the smallest 5%
of the values. In other words, it is the mean of all the values between the 5th and 95th
percentiles of x. A trimmed mean is a location measure.
2.1.6
Robustness
A robust measure of location is one that is not affected by a few extremely large or extremely
small values. Values of a numeric variable that lie a great distance from most of the other
values are called outliers. Outliers might be the result of mistakes in measuring or recording
data, perhaps from misplacing a decimal point. The mean is not a robust location measure.
It can be affected significantly by a single extreme outlier if that outlying value is extreme
enough. Thus, if there is any doubt about the quality of the data, the median or a trimmed
mean might be preferred to the mean as a reliable location measure. The median is very
insensitive to outliers. A 5% trimmed mean is insensitive to outliers that make up no more
than 5% of the data values.
Example 2.1. ”mydata” is a numeric vector of 21 made-up values.
> mydata
[1] 1 5
5
6
7
7
8 12 12 15 15 18 22 22 23 24 28 29 35 36 53
You can enter the data into your own R workspace with R’s ”scan” function, as follows.
> mydata=scan( )
1: 1 5 5 6 7 7 8 12 12 15
11: 15 18 22 22 23 24 28 29 35 36 53
22:
Read 21 items
When you call ”scan( )”, R will respond with the prompt 1:, meaning that it expects you
to enter the first data value. After entering the first data value, enter as many as you like,
separated by blank spaces. You can hit the enter key at any time to start a new line of
input. This is convenient if there are a lot of data values to be entered. After entering the
last data value hit the enter key twice to signal the end of the input. To check for errors,
simply type the name of the data object
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
> mydata
[1] 1 5
5
6
7
7
17
8 12 12 15 15 18 22 22 23 24 28 29 35 36 53
To find the mean of mydata, add the values together and divide by 21. Do this with your
calculator or by hand. Compare the answer to that given by R.
> mean(mydata)
[1] 18.2381
Now let us find the median of mydata. Since n = 21 = 2 × 10 + 1 is odd, we take the 11th
value in increasing order as the median.
median(mydata) = mydata(11) = 15.
This is confirmed by R:
> median(mydata)
[1] 15
Next, we will find the 25th percentile q(mydata, 0.25). According to the definition, we are
looking for a number q such that at least 25% of the data values are ≤ q and at least 75%
are ≥ q. At least 25% means at least 5.25, but since we must have a whole number, at least
6 values must be ≤ q. Likewise, at least 15.75 or at least 16 values must be ≥ q. The unique
number satisfying these criteria is q = 7. Thus,
q(mydata, 0.25) = 7
This agrees with R.
> quantile(mydata,0.25)
25%
7
Notice that there are repeated values of mydata. The distinct values and their frequencies
can be obtained with R using the ”table” function.
> table(mydata)
mydata
1 5 6 7 8 12 15 18 22 23 24 28 29 35 36 53
1 2 1 2 1 2 2 1 2 1 1 1 1 1 1 1
The first row of this table shows the distinct values vi and the second row shows their
frequencies ni . We can use the formula (2.2) for the mean when values are repeated.
mydata =
1
(1 × 1 + 2 × 5 + 1 × 6 + 2 × 7 + · · · ) = 18.2381.
21
Finally, let us calculate the 5% trimmed mean of mydata. Similar to the ambiguity in the
definition of the quantiles, there is some ambiguity in the definition of the trimmed mean.
Five percent of 21 is not a whole number, so we must round up or down and eliminate
that number of the largest and the smallest values to calculate the trimmed mean. R’s
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
18
convention is to round down. Thus, we eliminate the largest and also the smallest value of
mydata before calculating the mean. The 5% trimmed mean is
1
(5 + 5 + 6 + 7 + 7 + · · · + 29 + 35 + 36) = 17.31579.
19
The ”mean” function in R is also good for trimmed means. You simply have to tell R the
degree of trimming.
> mean(mydata,trim=.05)
[1] 17.31579
2.1.7
The Five Number Summary
Go to TOC
The five number summary is a convenient way of summarizing numeric data. The five
numbers are the minimum value, the first quartile (25th percentile), the median, the third
quartile (75th percentile), and the maximum value. The R function is ”fivenum”.
> fivenum(mydata)
[1] 1 7 15 24 53
R has another, more useful function ”summary” which returns the five numbers and, in
addition, the mean.
> summary(mydata)
Min. 1st Qu. Median
1.00
7.00
15.00
Mean 3rd Qu.
18.24
24.00
Max.
53.00
Example 2.2. The gap between the largest value of mydata, 53, and the next largest,
36 is much greater than the gap between any other two consecutive values. The largest
value might be considered an outlier. Suppose that experimenters decide to discard that
observation as a possible error. R has a neat trick for discarding one or more values. Since
53 is the 21st component of mydata, you can discard it as follows.
> mydata
[1] 1 5 5
> mydata[-21]
[1] 1 5 5
6
7
7
8 12 12 15 15 18 22 22 23 24 28 29 35 36 53
6
7
7
8 12 12 15 15 18 22 22 23 24 28 29 35 36
Now compare the summaries.
> summary(mydata)
Min. 1st Qu. Median
1.00
7.00
15.00
> summary(mydata[-21])
Min. 1st Qu. Median
1.00
7.00
15.00
Mean 3rd Qu.
18.24
24.00
Max.
53.00
Mean 3rd Qu.
16.50
23.25
Max.
36.00
Notice that the median has not changed with the elimination of the outlier, but the mean
has. This illustrates the greater robustness of the median as a location measure.
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
2.1.8
19
Exercises
The reacttimes data set has 50 observations of human reaction times to a physical stimulus.
The reaction times are named Times and arranged in increasing order below.
0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35
1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20
2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73
1. Find the mean and median of Times without using R. You may use your calculator.
2. Import reacttimes into your workspace as a data frame with a single variable Times.
Attach the reacttimes data frame to your workspace with
> attach(reacttimes)
Then calculate the mean of Times by using R’s mean( ) function and the median with R’s
median( ) function.
3. Find the 60th percentile of Times without using R. There may be more than one acceptable answer.
4. Find the 60th percentile of Times using R’s quantile( ) function.
5. Find the 40th percentiles of mydata and mydata[-21] by hand and also by using R.
6. Find the 5% trimmed mean of Times.
7. Find the five number summary of Times.
8. The 40th value T imes(40) of the reaction time data is 2.32. Change it to 232.0 and
recalculate the mean and median. You can make the change in R by
> Times[40]=232.0
Change it back after you are finished with this exercise.
2.2
Grouped Data, Histograms, and Cumulative Frequency Diagrams
2.2.1
Frequency Tables
Large data sets are often summarized by grouping values. Let x be a numeric variable with
values x1 , x2 , . . . , xn . Choose numbers c0 < c1 < . . . < cm such that all the values of x are
between c0 and cm . For each i, let ni be the number of values of x (including repetitions)
that are in the interval (ci−1 , ci ], i.e., the number of indices j such that ci−1 < xj ≤ ci .
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
20
A frequency table of x is a table showing the class intervals (ci−1 , ci ] along with frequencies
ni with which the data values fall into each interval. Sometimes additional columns are
included
P showing the relative frequencies fi = ni /n, the cumulative relative frequencies
Fi = j≤i fj , and the midpoints of the class intervals.
Example 2.3. The reaction time data is repeated below.
0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35
1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20
2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73
We choose 5 class intervals of equal length 1 unit, beginning with c0 = 0 and ending with
c5 = 5.
Interval
(0,1]
(1,2]
(2,3]
(3,4]
(4,5]
Midpoint
0.5
1.5
2.5
3.5
4.5
ni
11
22
11
4
2
fi
0.22
0.44
0.22
0.08
0.04
Fi
0.22
0.66
0.88
0.96
1.00
With only a frequency table like the one above, the mean and median or the original data
cannot be calculated exactly. However, they can be estimated. If we take the midpoint of
an interval as a stand-in for all the values in that interval, then we can use the formula in
the preceding section for calculating a mean with repeated values. Thus, in the example
above, we would estimate the mean as
0.22(0.5) + .44(1.5) + 0.22(2.5) + 0.08(3.5) + 0.04(4.5) = 1.78
Estimating the median is a bit more difficult. By examining the cumulative frequencies Fi ,
we see that 22% of the data is less than or equal to 1 and 66% of the data is less than or
equal to 2. Therefore, the median lies between 1 and 2. That is, it is 1 + a certain fraction of
the distance from 1 to 2. A reasonable guess at that fraction is given by linear interpolation
between the cumulative frequencies at 1 and 2. In other words, we estimate the median as
1+
.50 − .22
(2 − 1) = 1.636.
.66 − .22
A cruder estimate of the median is just the midpoint of the interval that contains the
median, in this case 1.5. We leave it as an exercise to calculate the mean and median of the
reaction time data and to compare them to these approximations.
2.2.2
Histograms
The figure below is a histogram of the reaction times.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
21
> reacttimes=read.table(“reacttimes.txt”,header=T)
> hist(reacttimes$Times,breaks=0:5,xlab=”Reaction Times”)
20
Histogram of reacttimes$Times
10
0
5
Frequency
15
Go to TOC
0
1
2
3
4
5
Reaction Times
The histogram is a graphical depiction of the grouped data. The end points ci of the class
intervals are shown on the horizontal axis. This is an absolute frequency histogram because
the heights of the vertical bars above the class intervals are the absolute frequencies ni . A
relative frequency histogram would show the relative frequencies fi . A density histogram
has bars whose heights are the relative frequencies divided by the lengths of the corresponding class intervals. Thus, in a density histogram the area of the bar is equal to the relative
frequency. If all class intervals have the same length, these types of histograms all have the
same shape and convey the same visual information.
No doubt you have noticed that the description of the class intervals in a frequency table
and a histogram was very vague. The number of intervals can affect the appearance of the
histogram significantly. Too many class intervals result in a ”spiky” histogram that may emphasize spurious, accidental groupings of data values too much. Too few class intervals may
obscure real features of the data distribution. The number of intervals is usually decided
after some experimentation. A number of guidelines have been proposed. The default suggestion in R is Sturges’ rule: m = 1 + log2 (n), rounded up to the nearest integer. However,
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
22
Sturges’ rule is only a suggestion that probably will be overridden by the need to produce
a histogram that is easily interpreted and pleasing to the eye. R routinely and intelligently
violates Sturges’ rule for just such reasons.
In the example just above, the endpoints c0 = 0, c1 = 1, · · · , c5 = 5 of the class intervals were
given by the optional ”breaks” argument to the histogram function hist( ). If the ”breaks”
argument is omitted, R will choose its own class intervals.
> hist(reacttimes$Times,xlab=”Reaction Times”)
Histogram of reacttimes$Times
6
0
2
4
Frequency
8
10
12
Go to TOC
0
1
2
3
4
5
Reaction Times
2.2.3
Cumulative Frequency Diagrams
To construct a cumulative frequency diagram start with a frequency table of the grouped
data as in Example 2.3. Let us suppose there are m class intervals with endpoints c0 <
c1 < · · · < cm . Let F1 , F2 , · · · , Fm be the cumulative relative frequencies for the class
intervals. Plot the points (c0 , 0), (c1 , F1 ), · · · , (cm , Fm ) on a rectangular coordinate system
and connect adjacent points with straight line segments. The result will look similar to the
figure below.
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
23
0.8
0.2
0.4
0.6
Go to TOC
0.0
Cumulative Relative Frequency
1.0
> Fs=c(0, 0.22, 0.66, 0.88, 0.96, 1)
> plot(0:5,Fs,type=”l”,xlab=” “,ylab=”Cumulative Relative Frequency”)
> points(0:5,Fs)
0
1
2
3
4
5
At any position a on the horizontal axis, the height of the curve at that point is the approximate proportion of data values that are less than or equal to a. The height is the exact
proportion at the end points c0 , · · · , cm of the class intervals. In the diagram above, the
height of the curve at a = 2.5 is about 0.77. Therefore, about 77% of the data is ≤ 2.5.
24
0.8
0.6
0.2
0.4
Go to TOC
0.0
Cumulative Realtive Frequency
1.0
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
0
1
2
3
4
5
The diagram can be used in an inverse manner to find the approximate values of quantiles.
To approximate the pth quantile of the data, find the intersection of the curve with the horizontal line y = p. The horizontal coordinate of that point of intersection is the approximate
pth quantile. In our example, the horizontal line y = 0.60 intersects the curve at a point
with horizontal coordinate about 1.86. Therefore, the 60th percentile of the data is about
1.86.
25
0.8
0.6
0.2
0.4
Go to TOC
0.0
Cumulative Relative Frequency
1.0
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
0
2.2.4
1
2
3
4
5
Exercises
It is sometimes advantageous to transform data in some way, i.e., to define a new variable
y as a function of the old variable x. We might want to do this so that we can more easily
apply certain statistical inference procedures you will learn about later. A common transformation is the logarithmic transformation. The natural logarithms of the reaction times
are, to two places:
-2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13 0.02 0.08 0.11 0.12 0.16 0.19
0.21 0.30 0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64 0.65 0.73 0.74 0.77
0.78 0.79 0.83 0.84 0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55
1. Attach ”reacttimes” to your R workspace with and verify the data above with
> attach(reacttimes)
> log(Times)
Summarize the new variable.
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
26
> summary(log(Times))
2. Use 9 class intervals of equal length beginning with c0 = −2.5 and ending with c9 = 2.
Make a frequency table of log(Times) like the one in Example 3.
3. By hand, without using R, make a histogram of log(Times).
4. Use R to make a histogram of log(Times).
5. Estimate the mean and median of log(Times) from the grouped data. Compare to the
answers given in the summary.
Go to TOC
6. By hand, make a cumulative frequency diagram of log(Times). With it, estimate the 40th
percentile of log(Times). Compare your answer to that returned by the quantile( ) function.
7. Import the data set www.math.uh.edu/ charles/data/FEV.txt into R as a data frame
named ”FEV”. The variable ”fev” is a set of 654 values of forced expiratory volume (a measure of lung capacity) for human subjects. With R make a histogram of fev. Allow R to
choose its own class intervals. You can suggest a number of class intervals to R by using
the ”breaks” argument to the histogram function hist( ), e.g.,
> hist(FEV$fev,breaks=5)
R does not always accept your suggestion. Experiment with at least four different choices
for the number of class intervals and comment on the results.
2.3
Measures of Variability or Scale
2.3.1
The Variance and Standard Deviation
Let x be a population variable with values x1 , x2 , . . . , xn . Some of the values might be
repeated. The variance of x is
n
1X
var(x) = σ =
(xi − µ(x))2 .
n i=1
2
(2.5)
The standard deviation of x is
sd(x) = σ =
p
var(x).
(2.6)
When x1 , x2 , . . . , xn are values of x from a sample rather than the entire population, we
modify the definition of the variance slightly, use a different notation, and call these objects
the sample variance and standard deviation.
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
n
1 X
(xi − x̄)2 ,
n − 1 i=1
√
s = s2 .
s2 =
27
(2.7)
(2.8)
The reason for modifying the definition for the sample variance has to do with its properties
as an estimate of the population variance.
Alternate algebraically equivalent formulas for the variance and sample variance are
n
1X 2
x − µ(x)2 ,
σ =
n i=1 i
2
n
1 X 2
(
x − nx̄2 ).
s =
n − 1 i=1 i
2
These are sometimes easier to use for hand computation.
The standard deviation σ is called a measure of scale because of the way it behaves under linear transformations of the data. If a new variable y is defined by y = a + bx, where
a and b are constants, sd(y) = |b|sd(x). For example, the standard deviation of Fahrenheit
temperatures is 1.8 times the standard deviation of Celsius temperatures. The transformation y = a + bx can be thought of as a rescaling operation, or a choice of a different system
of measurement units, and the standard deviation takes account of it in a natural way.
2.3.2
The Mean and Median Absolute Deviation
Suppose that you must choose a single number c to represent all the values of a variable
x as accurately as possible. One measure of the overall error with which c represents the
values of x is
v
u n
u1 X
(xi − c)2 .
(2.9)
g(c) = t
n i=1
In the exercises, you are asked to show that this expression is minimized when c = x̄. In
other words, the single number which most accurately represents all the values is, by this
criterion, the mean of the variable. Furthermore, the minimum possible overall error, by
this criterion, is the standard deviation. However, this is not the only reasonable criterion.
Another is
n
1X
h(c) =
|xi − c|.
n i=1
It can be shown that this criterion is minimized when c = median(x). The minimum value
of h(c) is called the mean absolute deviation from the median. It is a scale measure which
is somewhat more robust(less affected by outliers) than the standard deviation, but still not
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
28
very robust. A related, very robust measure of scale is the median absolute deviation from
the median, or mad :
mad(x) = median(|x − median(x)|).
(2.10)
In R, the mad is adjusted by a constant factor of 1.4826. For data with a normal distribution
the adjusted mad is equal to the standard deviation. Normal distributions are discussed in
Chapter 5.
2.3.3
The Interquartile Range
The interquartile range of a variable x is the difference between its 75th and 25th percentiles.
IQR(x) = q(x, .75) − q(x, .25)
(2.11)
It is a robust measure of scale which is important in the construction and interpretation of
boxplots, discussed below.
All of these measures of scale are valid for comparison of the ”spread” or variability of numeric
variables about a central value. In general, the greater their values, the more spread out the
values of the variable are. Of course, the standard deviation, median absolute deviation, and
interquartile range of a variable are different quantities and one must be careful to compare
like measures. The standard deviation, mad, IQR and other measures of scale obey the
basic formula for changes in measurement scale. If a and b are constants and y = a + bx,
then
sd(y) = |b|sd(x)
(2.12)
mad(y) = |b|mad(x)
(2.13)
IQR(y) = |b|IQR(x)
(2.14)
Example 2.4. The ”xdata” data set is a simulated sample of 100 numeric observations.
The pictures below show the histograms arising from multiplying the data by 1, 2, 0.5, and
3. All the measures of scale – standard deviation, mad, IQR and so on, will be multiplied
by the same factors.
Go to TOC
−10
−5
0
5
10 20 30 40 50
10
−10
−5
0
5
10
Go to TOC
0
Frequency
0
10 20 30 40 50
2 * xdata
10 20 30 40 50
xdata
Frequency
29
0
Frequency
10 20 30 40 50
0
Frequency
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
−10
−5
0
5
10
0.5 * xdata
> var(xdata); sd(xdata)
[1] 1.144041
[1] 1.069598
> var(2*xdata); sd(2*xdata)
[1] 4.576163
[1] 2.139197
> mad(xdata); IQR(xdata)
[1] 0.9334394
[1] 1.24708
> mad(2*xdata); IQR(2*xdata)
[1] 1.866879
[1] 2.494159
−10
−5
0
3 * xdata
5
10
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
2.3.4
30
Exercises
1. By hand, find the sample variance and standard deviation of mydata. Repeat with R.
2. By hand, find the median absolute deviation of mydata. Repeat with R and observe that
R’s answer is 1.4826 times the answer you got by hand.
3. Find the variance and standard deviation of the response time data. Treat it as a sample
from a larger population.
4. Find the interquartile range and the median absolute deviation for the response time data.
5. In the response time data, replace the value x40 = 2.32 by 232.0. Recalculate the standard deviation, the interquartile range and the median absolute deviation and compare with
the answers from problems 3 and 4.
6. Show that the function g(c) in equation (2.9) is minimized when c = µ(x). Hint: Minimize
g(c)2 .
2.4
Boxplots
Boxplots are also called box and whisker diagrams. Essentially, a boxplot is a graphical
representation of the five number summary. The boxplot below depicts the sensory response
data of the preceding section without the log transformation.
> boxplot(reacttimes$Times,horizontal=T,xlab=”Reaction Times”)
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
31
Go to TOC
0
1
2
3
4
Reaction Times
> summary(reacttimes)
Times
Min.
:0.120
1st Qu.:1.090
Median :1.530
Mean
:1.742
3rd Qu.:2.192
Max.
:4.730
The central box in the diagram encloses the middle 50% of the numeric data. Its left and
right boundaries of the box mark the first and third quartiles. The boldface middle line in
the box marks the median of the data. Thus, the interquartile range is the distance between
the left and right boundaries of the central box. For construction of a boxplot, an outlier
is defined as a data value whose distance from the nearest quartile is more than 1.5 times
the interquartile range. Outliers are indicated by isolated points (tiny circles in this boxplot). The dashed lines extending outward from the quartiles are called the whiskers. They
extend from the quartiles to the most extreme values in either direction that are not outliers.
This boxplot shows a number of interesting things about the response time data.
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
32
(a) The median is about 1.5. The interquartile range is slightly more than 1.
(b) The three largest values are outliers. They lie a long way from most of the data. They
might call for special investigation or explanation.
(c) The distribution of values is not symmetric about the median. The values in the lower
half of the data are more crowded together than those in the upper half. This is shown by
comparing the distances from the median to the two quartiles, by the lengths of the whiskers
and by the presence of outliers at the upper end .
The asymmetry of the distribution of values is also evident in the histogram of the preceding
section.
2.4.1
Exercises
1. Make a boxplot of the log-transformed reaction time data. Is the transformed data more
symmetrically distributed than the original data?
2. The average public school teacher salaries in thousands of dollars for all 51 states and
Washington D.C. are in the data set teacher salaries. The salary data in the Pay variable
are listed below in increasing order.
18.1 18.4 19.5 19.6 20.3 20.3 20.5 20.6 20.8
21.0 21.4 21.6 21.7 21.8 22.0 22.1 22.3 22.3
22.8 22.9 23.4 24.3 24.5 24.6 25.2 25.6 25.8
25.9 26.0 26.5 26.6 26.6 26.8 27.2 27.2 27.2
27.6 29.1 29.5 30.2 30.7 34.0 41.5
20.9
22.5
25.8
27.2
20.9
22.6
25.9
27.4
By hand, make a boxplot of the data above.
3. Use R to make a boxplot of Pay in teacher salaries.
4. By hand, make a boxplot of mydata[-21]. Show any outliers.
5. Make a boxplot of mydata with R.
6. The data set airquality is one of R’s included data sets. It shows daily measurements of
ozone concentration (Ozone), solar radiation (Solar.R), wind speed (Wind), and temperature (Temp) for 5 summer months in 1977 in New York City. Some of the observations are
missing and are recorded as NA, meaning not available. View an overall summary of the
variables in airquality with the command
> summary(airquality)
Ignore the summaries for Month and Day since those variables should be factors, not numeric variables, and their summaries are meaningless. Attach airquality to your workspace
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
33
> attach(airquality)
and make boxplots of Ozone, Solar.R, Wind, and Temp. Comment on any noteworthy
features.
2.5
Factor Variables and Barplots
2.5.1
Tabulated Factor Variables
The location and scale measures discussed up to this point apply to numeric variables but
not to factor variables. The best way to summarize the values of factor variables is to
tabulate them and display the frequencies with a bar chart or barplot.
Example 2.5. The Montana outlook poll was a study conducted by the Bureau of Business and Economic Research, University of Montana in 1993. A sample of 209 Montana
residents were classified according to their age group, their sex, their income group, their
political afiliation, the area of the state they lived in, whether they expected their personal finances to improve, and whether they expected the state’s financial situation to
improve. The data is at Montana.txt. The data in numeric form with description is at
http://lib.stat.cmu.edu/DASL. Here is a summary of all the variables in the data frame.
> summary(Montana)
AGE
SEX
INC
35K :60
35-54:66
20-35K:83
NA’s : 1
NA’s :19
POL
Dem :84
Ind :40
Rep :78
NA’s: 7
AREA
NE:58
SE:78
W :73
FIN
better:71
same :76
worse :61
NA’s : 1
STAT
better
:118
no better: 63
NA’s
: 28
All of these variables are factor variables. Their values are tabulated in the summary above,
but for illustration we will tabulate the income variable INC separately using R’s table( )
function.
> attach(Montana)
> table(INC)
INC
35K 20-35K
47
60
83
Notice that the 19 missing values are simply omitted from the table. A bar chart or barplot
can be constructed from the tabulated values.
> attach(Montana)
> barplot(table(INC))
Go to TOC
34
60
80
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
0
20
40
Go to TOC
35K
20−35K
The category labels below this plot are not in their natural order. To correct it we will tell
R to rearrange the categories and put the third category (20-35K) second and the second
category (>35K) third.
> barplot(table(INC)[c(1,3,2)])
35
60
80
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
0
20
40
Go to TOC
35K
A bar plot and a histogram are superficially similar but they are different. A histogram is
for numeric data after it has been grouped, so it is a type of bar plot. However, bar plots are
also useful for non-numeric categories or factors. Notice that in our examples histograms
have a measurement scale on the horizontal axis. Barplots for factor variables do not.
2.5.2
Exercises
1. Make bar plots of the other variables in Montana.
2. The teacher salaries data set has a variable called Region which indicates which region
of the U.S. a state is in. Tabulate Region and make a bar plot of it.
3. WorldPhones is a dataset included with R. First, read about it.
> help(WorldPhones)
Then display it with
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
36
> WorldPhones
WorldPhones is a matrix, not a data frame. The row names of the matrix are the years
”1951”, ”1956”, etc. The column names are the geographical regions ”N.Amer”, ”Europe”,
etc. You can extract a single column or a single row by, for example,
> WorldPhones[,”Europe”]
> WorldPhones[”1961”,]
You can make barplots of any column or row simply by embedding these commands as
arguments of the barplot( ) function. Make barplots of all the rows. Does it seem that
telephone usage became more evenly distributed across the regions for the years 1951-1961?
Bear in mind that the vertical axis scales are different for different years.
2.6
Jointly Distributed Variables
When two or more variables are jointly distributed, or jointly observed, it is important to
understand how they are related and how closely they are related. We will consider just two
variables, generically named x and y.
2.6.1
Two Factor Variables
When x and y are both factor variables the best way to reveal their relationship is to
cross tabulate them. If x has levels a1 , a2 , · · · , ar , y has levels b1 , b2 , · · · , bc , and there are
n joint observations of x and y, then their cross tabulation is the r × c matrix with entries
nij equal to the number of cases in which x = ai and y = bj . The cross tabulation is easy
to accomplish with the table( ) function of R.
Example 2.6. There are n = 209 cases in the Montana data. The cross tabulation of the
two variables x = AREA (region of the state) and y = POL (political party preference) is
> table(AREA,POL)
POL
AREA Dem Ind Rep
NE 15 12 30
SE 30 16 31
W
39 12 17
From the table we see that there were 15 respondents to the survey in the northeastern
region who preferred the Democratic Party. There were 30 in the northeast who were Republicans.
It may be more revealing to show a table of relative frequencies rather than absolute frequencies. To do so, simply divide all the table entries nij by the total number of cases n.
In our example, the relative frequency table rounded to 3 places is
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
37
> table(AREA,POL)/209
POL
AREA
Dem
Ind
Rep
NE 0.07177033 0.05741627 0.14354067
SE 0.14354067 0.07655502 0.14832536
W 0.18660287 0.05741627 0.08133971
> round(.Last.value,3)
POL
AREA
Dem
Ind
Rep
NE 0.072 0.057 0.144
SE 0.144 0.077 0.148
W 0.187 0.057 0.081
Go to TOC
The relative frequencies in a cross tabulation can be displayed with a mosaic plot.
0.0
Dem
0.2
Ind
0.4
POL
0.6
Rep
0.8
1.0
> attach(Montana)
> plot(POL~AREA)
NE
SE
W
AREA
The formula P OL ∼ AREA tells R to treat AREA as the x variable, with values arrayed
horizontally, and POL as the y variable with values arrayed vertically. The widths of the
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
38
vertical bars vary slightly because they are proportional to the relative frequencies of the
levels of the x variable. From the plot you can see that the western region is predominantly
Democratic while the northeastern region is predominantly Republican, at least in the sample. The southeastern region has the greatest sample representation and it is about evenly
split between the two major parties.
2.6.2
One Factor and One Numeric Variable
We will next consider the case where x is a factor and y is numeric. The figure below
compares placement test scores for each of the letter grades in a sample of 179 students who
took a particular math course in the same semester under the same instructor. The two
jointly observed population variables are the letter grade received and the placement test
score. The figure separates test scores according to the letter grade and shows a boxplot
for each group of students. One would expect to see a decrease in the median test score as
the letter grade decreases and that is confirmed by the picture. However, the decrease in
median test scores from a letter grade of B to a grade of F is not very dramatic, especially
compared to the size of the IQRs. This suggests that the placement test is not especially
good at predicting a student’s final grade in the course. Notice the two outliers. The outlier
for the ”W” group is clearly a mistake in recording data because the scale of scores only
went to 100.
> test.vs.grade=read.csv(“test.vs.grade.csv”,header=T)
> attach(test.vs.grade)
> plot(Test~Grade,varwidth=T)
Go to TOC
39
80
100
120
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
40
60
Test
Go to TOC
A
B
C
D
F
W
Grade
We used the same formula argument of the form y ∼ x here as in the previous example.
The plot function plot( ) knows to produce side by side boxplots when y is numeric and x
is a factor. The boxplot( ) function would work just as well. The argument ”varwidth=T”
tells R to allow the widths of the boxes to vary and reflect the number of observations in
each group.
2.6.3
Two Numeric Variables
Scatterplots
Next, we consider the case where both x and y are numeric variables, jointly observed, so
that we have the same number n of observations of each. Indeed, we have n pairs of observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). If we plot the n points in a Cartesian plane, we obtain
a scatterplot or a scatter diagram of the two variables.
Below are the first 10 rows of the ”Payroll” data set. The column labeled ”payroll” is the total
monthly payroll in thousands of dollars for each company listed. The column ”employees”
is the number of employees in each company and ”industry” indicates which of two related
industries the company is in. A scatterplot of all 50 values of the two variables ”payroll”
and ”employees” is also shown.
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
40
350
> Payroll=read.table(“Payroll.txt”,header=T)
> attach(Payroll)
> plot(payroll~employees,col=industry)
250
150
200
payroll
300
Go to TOC
50
100
150
employees
> Payroll[1:10,]
payroll employees industry
1
190.67
85
A
2
233.58
109
A
3
244.04
130
B
4
351.41
166
A
5
298.60
154
B
6
241.43
124
B
7
143.93
38
B
8
242.33
116
A
9
216.88
103
A
10 195.97
101
A
The scatterplot shows that in general the more employees a company has, the higher its
monthly payroll. Of course this is expected. It also shows that the relationship between the
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
41
number of employees and the payroll is quite strong. For any given number of employees, the
variation in payrolls for that number is small compared to the overall variation in payrolls
for all employment levels. In this plot, the data from industry A is in black and that from
industry B is red. The plot shows that for employees ≥ 100, payrolls for industry A are
generally greater than those for industry B at the same level of employment.
Covariance and Correlation
If x and y are jointly distributed numeric variables, we define their covariance as
n
cov(x, y) =
1X
(xi − µ(x))(yi − µ(y)).
n i=1
If x and y come from samples of size n rather than the whole population, replace the
denominator n by n − 1 and the population means µ(x), µ(y) by the sample means x̄, ȳ
to obtain the sample covariance. The sign of the covariance reveals something about the
relationship between x and y. If the covariance is negative, values of x greater than µ(x)
tend to be accompanied by values of y less than µ(y). Values of x less than µ(x) tend to go
with values of y greater than µ(y), so x and y tend to deviate from their means in opposite
directions. If cov(x, y) > 0, they tend to deviate in the same direction. The strength of
these tendencies is not expressed by the covariance because its magnitude depends on the
variability of each of the variables about its mean. To correct this, we divide each deviation
in the sum by the standard deviation of the variable. The resulting quantity is called the
correlation between x and y:
cor(x, y) =
cov(x, y)
.
sd(x) ∗ sd(y)
The correlation between payroll and employees in the example above is 0.9782 (97.82 %).
Theorem 2.1. The correlation between x and y satisfies −1 ≤ cor(x, y) ≤ 1. cor(x, y) = 1
if and only if there are constants a and b > 0 such that y = a + bx. cor(x, y) = −1 if and
only if y = a + bx with b < 0.
A correlation close to 1 indicates a strong positive relationship (tending to vary in the same
direction from their means) between x and y while a correlation close to −1 indicates a strong
negative relationship. A correlation close to 0 indicates that there is no linear relationship
between x and y. In this case, x and y are said to be (nearly) uncorrelated. There might
be a relationship between x and y but it would be nonlinear. The picture below shows a
scatterplot of two variables that are clearly related but very nearly uncorrelated.
> xs=runif(500,0,3*pi)
> ys=sin(xs)+rnorm(500,0,.15)
> cor(xs,ys)
[1] 0.005307598
> plot(xs,ys)
Go to TOC
42
0.5
1.0
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
0.0
−1.0
−0.5
ys
Go to TOC
0
2
4
6
8
xs
Some sample scatterplots of variables with different population correlations are shown below.
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
cor(x,y)=0.3
−1
0
−3 −2 −1
0
1
1
2
cor(x,y)=0
43
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
Go to TOC
cor(x,y)=0.9
−2
−2 −1
0
0
1
1
2
2
cor(x,y)=−0.5
−2
2.6.4
−1
0
1
2
−2
−1
0
1
2
Exercises
1. With the Montana data, cross tabulate AREA and INC. Also make a mosaic plot of
these two variables. Do these suggest anything about the economics of Montana?
2. Do the same for AREA and POL. What, if anything, do you conclude about the politics
of Montana?
3. Do the same for AREA and AGE. Draw the appropriate conclusions.
4. With the Auto Pollution Filter Noise data, construct side by side boxplots of the variable
NOISE for the different levels of the factor SIZE. Comment. Do the same for NOISE and
TYPE.
5. With the Payroll data, construct side by side boxplots of ”employees” versus ”industry”
and ”payroll” versus ”industry”. Are these boxplots as informative as the color coded scatterplot in Section 2.3.2?
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
44
6. If you are using Rstudio click on the ”Packages” tab, then the checkbox next to the library
MASS. Click on the word MASS and then the data set ”mammals” and read about it. If
you are using R alone, in the Console window at the prompt > type
> data(mammals,package=”MASS”).
View the data with
> mammals
Make a scatterplot with the following commands and comment on the result.
Go to TOC
> attach(mammals)
> plot(body,brain)
Also make a scatterplot of the log transformed body and brain weights.
> plot(log(body),log(brain))
A recently discovered hominid species homo floresiensis had an estimated average body
weight of 25 kg. Based on the scatterplots, what would you guess its brain weight to be?
7. Let x and y be jointly distributed numeric variables and let z = a + by, where a and b
are constants. Show that cov(x, z) = b ∗ cov(x, y). Show that if b > 0, cor(x, z) = cor(x, y).
What happens if b < 0?
8. Find the covariance and correlation between payroll and employees for the first 10 rows
only of the Payroll data.
Chapter 3
Probability
3.1
Go to TOC
Background
In Chapter 1 we described a random experiment as one that can be replicated indefinitely
many times and whose outcome has a degree of uncertainty from replication to replication.
The uncertainty in a random experiment is subject to treatment with the tools of mathematical probability. The mathematical theory of probability is a huge subject which has
developed separately from the development of statistics. In this chapter we describe only
its most basic elements.
Recall from Chapter 1 that the set of all possible outcomes of a random experiment is called
its sample space and is denoted by the symbol Ω. An event is a set of outcomes, i.e., a
subset of Ω. A probability measure is a function which assigns numbers between 0 and 1
to events. The number assigned to an event is called its probability. If the sample space
Ω, the collection of events, and the probability measure are all specified, they constitute a
probability model of the random experiment.
Probability models do not come directly from nature. They are devised by researchers seeking to understand regularities in the phenomena they are studying. Possibly, the observed
result of an experiment cannot easily be reconciled with predictions based on the probability
model. In this case, the model is called into question or even refuted. The formalization of
this process constitutes most of the subject of statistical inference.
3.2
Equally Likely Outcomes
The simplest probability models have a finite sample space Ω. The collection of events is
the collection of all subsets of Ω and the probability of an event is simply the proportion
of all possible outcomes that correspond to that event. In such models, we say that the
experiment has equally likely outcomes. If the sample space has N elements and E is a
subset of Ω, then
45
CHAPTER 3. PROBABILITY
46
P r(E) =
#(E)
.
N
Each of the elementary events {ω} consisting of a single outcome has the same probability N1 .
Here we introduce some notation that will be used throughout this text. The probability measure for a random experiment is denoted by the abbreviation P r, sometimes with
subscripts. Events will be denoted by upper case Latin letters near the beginning of the
alphabet. The expression #(E) denotes the number of elements of the subset E.
Example 3.1. The Payroll data consists of 50 observations of 3 variables, ”payroll”, ”employees” and ”industry”. Suppose that a random experiment is to choose one record from
the Payroll data and suppose that the experiment has equally likely outcomes. Then, as the
summary below shows, the probability that industry A is selected is
P r(industry = A) =
27
= 0.54.
50
> Payroll=read.table(“Payroll.txt”,header=T)
> summary(Payroll)
payroll
employees
industry
Min.
:129.1
Min.
: 26.00
A:27
1st Qu.:167.8
1st Qu.: 71.25
B:23
Median :216.1
Median :108.50
Mean
:228.2
Mean
:106.42
3rd Qu.:287.8
3rd Qu.:143.25
Max.
:354.8
Max.
:172.00
In this example we use another common and convenient notational convention. The event
whose probability we want is described in quasi-natural language as ”industry=A” rather
than with the the formal but too cumbersome {ω ∈ P ayroll|industry(ω) = A}. The description ”industry=A” refers to the set of all possible outcomes of the experiment for which
the variable ”industry” has the value ”A”. This sort of informal description of an event will
be used again and again.
The assumption of equally likely outcomes is an assumption about the selection procedure
for obtaining one record from the data. It is conceivable that a selection method is employed
for which this assumption is not valid. If so, we should be able to discover that it is invalid
by replicating the experiment sufficiently many times. This is a basic principle of classical
statistical inference. It relies on a famous result of mathematical probability theory called
the law of large numbers. One version of it is loosely stated as follows:
Law of Large Numbers: Let E be an event associated with a random experiment and let
P r be the probability measure of a true probability model of the experiment. Suppose the
experiment is replicated n times and let Pbr(E) = n1 × # replications in which E occurs.
Go to TOC
CHAPTER 3. PROBABILITY
47
Then Pbr(E) → P r(E) as n → ∞.
Pbr(E) is called the empirical probability of E.
3.3
Combinations of Events
Events are related to other events by familiar set operations. Let E1 , E2 , . . . be a finite or
infinite sequence of events. The union of E1 and E2 is the event
E1 ∪ E2 = {ω ∈ Ω|ω ∈ E1 or ω ∈ E2 }.
Go to TOC
More generally,
[
Ei = E1 ∪ E2 ∪ . . . = {ω ∈ Ω|ω ∈ Ei for some i }.
i
The intersection of E1 and E2 is the event
E1 ∩ E2 = {ω ∈ Ω|ω ∈ E1 and ω ∈ E2 },
and, in general,
\
Ei = E1 ∩ E2 ∩ . . . = {ω ∈ Ω|ω ∈ Ei for all i}.
i
Sometimes we omit the intersection symbol ∩ and simply conjoin the symbols for the events
in an intersection. In other words,
E1 E2 . . . En = E1 ∩ E2 ∩ . . . ∩ En .
The complement of the event E is the event
∼
E = {ω ∈ Ω|ω ∈
/ E}.
∼
E occurs if and only if E does not occur. The event E1∼ E2 occurs if and only if E1 occurs
and E2 does not occur.
Finally, the entire sample space Ω is an event with complement φ, the empty event. The
empty event never occurs. We need the empty event because it is possible to formulate a
perfectly sensible description of an event which happens never to be satisfied. For example,
if Ω = Payroll, the event ”employees < 25” is never satisfied, so it is the empty event.
We also have the subset relation between events. E1 ⊆ E2 means that if E1 occurs, then
E2 occurs, or in more familiar language, E1 is a subset of E2 . For any event E, it is true
that φ ⊆ E ⊆ Ω. E2 ⊇ E1 means the same as E1 ⊆ E2 .
CHAPTER 3. PROBABILITY
3.3.1
48
Exercises
1. A random experiment consists of throwing a pair of dice, say a red die and a green
die, simultaneously. They are standard 6-sided dice with one to six dots on different faces.
Describe the sample space.
2. For the same experiment, let E be the event that the sum of the numbers of spots on the
two dice is an odd number. Write E as a subset of the sample space, i.e., list the outcomes
in E.
3. List the outcomes in the event F = ”the sum of the spots is a multiple of 3”.
Go to TOC
4. Find ∼ F , E ∪ F , EF = E ∩ F , and E ∼ F .
5. Assume that the outcomes of this experiment are equally likely. Find the probability of
each of the events in # 4.
6. Show that for any events E1 and E2 , if E1 ⊆ E2 then ∼ E2 ⊆∼ E1 .
7. The ”mammals” data set in the ”MASS” library contains the result of a study of sleep in
mammal species. 1 2 Load the ”mammals” data set into your R workspace. In Rstudio you
can click on the ”Packages” tab and then on the checkbox next to MASS. Without Rstudio,
type
> data(mammals,package=”MASS”)
Attach the mammals data frame to your R search path with
> attach(mammals)
A random experiment is to choose one of the species listed in this data set. All outcomes
are equally likely. You can obtain a list of the species in the event ”body > 200” with the
command
> subset(mammals,body>200)
What is the probability of this event, i.e., what is the probability that you randomly select
a species with a body weight greater than 200 kg?
You can obtain a count of the species with body weights greater than 200 kg, by
> sum(body > 200)
1 Weisberg, S. (1985) Applied Linear Regression. 2nd edition. Wiley, pp. 144-5.
2 Allison, T. and Cicchetti, D. V. (1976) Sleep in mammals:
Science 194, 732-734
ecological and constitutional correlates.
CHAPTER 3. PROBABILITY
49
8. What are the species in the event that the ratio of brain weight to body weight is greater
than 0.02? Remember that brain weight is recorded in grams and body weight in kilograms,
so body weight must be multiplied by 1000 to make the two weights comparable. The species
belonging to this event can be obtained with the R command
> subset(mammals,brain/body/1000 > 0.02)
What is the probability of that event?
3.4
Rules for Probability Measures
The assumption of equally likely outcomes is often the starting point for the construction
of a probability model. However, there are many random experiments for which this assumption is wrong. Regardless of how a probability measure for a model of a a random
experiment is chosen, there are certain rules that it must satisfy. They are:
1. 0 ≤ P r(E) ≤ 1 for each event E.
2. P r(Ω) = 1.
3. IfSE1 , E2 , . .P
. is a finite or infinite sequence of events such that Ei Ej = φ for i 6= j, then
P r( i Ei ) = i P r(Ei ). If Ei Ej = φ for all i 6= j we say that the events E1 , E2 , . . . are
pairwise disjoint. This means that no two of the events can both occur simultaneously, so
to speak.
These are the basic rules. There are other properties that may be derived from them as
theorems.
4. P r(E ∼ F ) = P r(E)−P r(EF ) for all events E and F . In particular, P r(∼ E) = 1−P r(E)
5. P r(φ) = 0.
6. P r(E ∪ F ) = P r(E) + P r(F ) − P r(EF ) for all events E and F .
7. If E ⊆ F , then P r(E) ≤ P r(F ).
S
8. If E1 ⊆ E2 ⊆ . . . is an infinite sequence of events, then P r( i Ei ) = limi→∞ P r(Ei ).
T
9. If E1 ⊇ E2 ⊇ . . . is an infinite sequence of events, then P r( i Ei ) = limi→∞ P r(Ei ).
Go to TOC
CHAPTER 3. PROBABILITY
3.5
50
Counting Outcomes. Sampling with and without
Replacement
Suppose a random experiment with sample space Ω is replicated n times. The result is a
sequence (ω1 , ω2 , . . . , ωn ), where ωi ∈ Ω is the outcome of the ith replication. This sequence
is the outcome of a so-called compound experiment – the sequential replications of the basic
experiment. The sample space of this compound experiment is the n-fold cartesian product
Ωn = Ω × Ω × · · · × Ω. Now suppose that the basic experiment is to choose one member of a
finite population with N elements. We may identify the sample space Ω with the population.
Consider an outcome (ω1 , ω2 , . . . , ωn ) of the replicated experiment. There are N possibilities
for ω1 and for each of those there are N possibilities for ω2 and for each pair ω1 , ω2 there
are N possibilities for ω3 , and so on. In all, there are N × N × · · · × N = N n possibilities
for the entire sequence (ω1 , ω2 , · · · , ωn ). If all outcomes of the compound experiment are
equally likely, then each has probability N1n . Moreover, it can be shown that the compound
experiment has equally likely outcomes if and only if the basic experiment has equally likely
outcomes, each with probability N1 .
Definition: An ordered random sample of size n with replacement from a population of size
N is a randomly chosen sequence of length n of elements of the population, where repetitions
are possible and each outcome (ω1 , ω2 , · · · , ωn ) has probability N1n .
Now suppose that we sample one element ω1 from the population, with all N outcomes
equally likely. Next, we sample one element ω2 from the population excluding the one
already chosen. That is, we randomly select one element from Ω ∼ {ω1 } with all the remaining N − 1 elements being equally likely. Next, we randomly select one element ω3 from
the the N − 2 elements of Ω ∼ {ω1 , ω2 }, and so on until at last we select ωn from the
remaining N − (n − 1) elements of the population. The result is a nonrepeating sequence
(ω1 , ω2 , · · · , ωn ) of length n from the population. A nonrepeating sequence of length n is
also called a permutation of length n from the N objects of the population. The total
!
. Obviously, we
number of such permutations is N × (N − 1) × · · · × (N − n + 1) = (NN−n)!
must have n ≤ N for this to make sense. The number of permutations of length N from a
set of N objects is N !. It can be shown that, with the sampling scheme described above,
all permutations of length n are equally likely to result. Each has probability (NN−n)!
of
!
occurring.
Definition: An ordered random sample of size n without replacement from a population of
size N is a randomly chosen nonrepeating sequence of length n from the population where
each outcome (ω1 , ω2 , · · · , ωn ) has probability (NN−n)!
! .
Most of the time when sampling without replacement from a finite population, we do not
care about the order of appearance of the elements of the sample. Two nonrepeating sequences with the same elements in different order will be regarded as equivalent. In other
words, we are concerned only with the resulting subset of the population. Let us count the
number of subsets of size n from a set of N objects. Temporarily, let C denote that number.
Each subset of size n can be ordered in n! different ways to give a nonrepeating sequence.
Go to TOC
CHAPTER 3. PROBABILITY
51
!
Thus, the number of nonrepeating sequences of length n is C times n!. So, (NN−n)!
= C × n!
N
N
N!
i.e., C = n!(N −n)! = n . This is the same binomial coefficient n that appears in the
n N −n
PN
binomial theorem: (a + b)N = n=0 N
.
n a b
Definition: A simple random sample of size n from a population of size N is a randomly
chosen subset of size n from the population, where each subset has the same probability of
being chosen, namely N1 .
(n)
A simple random sample may be obtained by choosing objects from the population sequentially, in the manner described above, and then ignoring the order of their selection.
Example: The Birthday Problem
There are N = 365 days in a year. (Ignore leap years.) Suppose n = 23 people are
chosen randomly and their birthdays recorded. What is the probability that at least two of
them have the same birthday?
Solution: Arbitrarily numbering the people involved from 1 to n, their birthdays form an
ordered sample, with replacement, from the set of N = 365 birthdays. Therefore, each
sequence has probability N1n of occurring. No two people have the same birthday if and
only if the sequence is actually nonrepeating. The number of nonrepeating sequences of
birthdays is N (N − 1) · · · (N − n + 1). Therefore, the event ”No two people have the same
birthday” has probability
N (N − 1) · · · (N − n + 1)
N (N − 1) · · · (N − n + 1)
=
n
N
N × N × ··· × N
1
2
n−1
)(1 − ) · · · (1 −
)
N
N
N
With n = 23 and N = 365 we can find this in R as follows:
= (1 −
> prod(1-(1:22)/365)
[1] 0.4927028
So, there is about a 49% probability that no two people in a random selection of 23 have the
same birthday. In other words, the probability that at least two share a birthday is about
51%.
An important, intuitively obvious principle in statistics is that if the sample size n is very
small in comparison to the population size N , a sample taken without replacement may
be regarded as one taken with replacement, if it is mathematically convenient to do so.
A sample of size 100 taken with replacement from a population of 100,000 has very little
chance of repeating itself. The probability of a repetition is about 5%.
Go to TOC
CHAPTER 3. PROBABILITY
3.5.1
52
Exercises
1. A red 6-sided die and a green 6-sided die are thrown simultaneously. The outcomes of
this experiment are equally likely. What is the probability that at least one of the dice lands
with a 6 on its upper face?
2. A hand of 5-card draw poker is a simple random sample from the standard deck of 52
cards. How many 5 draw poker hands are there? In 5-card stud poker, the cards are dealt
sequentially and the order of appearance is important. How many 5-stud poker hands are
there?
3. How many hands of 5-draw poker contain the ace of hearts? What is the probability that
a 5-card draw hand contains the ace of hearts?
4. Everybody in Ourtown is a fool or a knave or possibly both. 70% of the citizens are fools
and 85% are knaves. One citizen is randomly selected to be mayor. What is the probability
that the mayor is both a fool and a knave?
5. What is the probability that the mayor is a fool but not a knave?
6. A Martian year has 669 days. An R program for calculating the probability of no repetitions in a sample with replacement of n birthdays from a year of N days is given below.
> birthdays=function(n,N) prod(1-1:(n-1)/N)
To invoke this function with, for example, n=12 and N=400 simply type
> birthdays(12,400)
Check that the program gives the right answer for N=365 and n=23. Then use it to find
the number n of Martians that must be sampled in order for the probability of a repetition
to be at least 0.5.
7. A standard deck of 52 cards has four queens. Two cards are randomly drawn in succession, without replacement, from a standard deck. What is the probability that the first
card is a queen? What is the probability that the second card is a queen? If three cards are
drawn, what is the probability that the third is a queen? Make a general conjecture. Prove
it if you can. (Hint: Does the probability change if ”queen” is replaced by ”king” or ”seven”?)
3.6
Conditional Probability
Definition: Let A and B be events with P r(B) > 0. The conditional probability of A, given
B is:
P r(AB)
P r(A|B) =
.
(3.1)
P r(B)
Go to TOC
CHAPTER 3. PROBABILITY
53
P r(A) itself is called the unconditional probability of A.
Example 3.2. R includes a tabulation by various factors of the 2201 passengers and crew
on the fatal voyage of the Titanic. Re…
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.
Read more