NOTE: This page has been revised for Winter 2021, but may undergo further edits.
The general idea that underlies statistical inference is the comparison of particular statistics from on observational data set (i.e. the mean, the standard deviation, the differences among the means of subsets of the data), with an appropriate reference distribution in order to judge the significance of those statistics. When various assumptions are met, and specific hypotheses about the values of those statistics that should arise in practice have been specified, then statistical inference can be a powerful approach for drawing scientific conclusions that efficiently uses existing data or those collected for the specific purpose of testing those hypotheses. Even in a context when a formal experimental design is not possible, or when the objective is to explore the data, significance evaluation can be useful.
As a consequence of the central limit theorem, we know that the mean is normally distributed, and so we can use the normal distribution to describe the uncertainty of a sample mean.
Once a sample has been obtained, and descriptive statistics calculated, attention may then turn to the significance (representativeness as opposed to unusualness) of the sample or of the statistics. This information may be gained by comparing the specific value of a statistic with an appropriate reference distribution, and by the calculation of additional statistics that describe the level of uncertainty a particular statistic may have.
In the case of the sample mean, the appropriate reference distribution is the normal distribution, which is implied by the Central Limit Theorem.
Uncertainty in the mean can be described by the standard error of the mean or by the confidence interval for the mean. The standard error of the mean can be thought of as the standard deviation of a set mean values from repeated samples.
Definition of the standard error of the mean
Here is a demonstration using simulated data and repeated samples of different sizes
# generate 1000 random numbers from the normal distribution
<- 1000
npts <- 5; demo_sd <- 2
demo_mean <- rnorm(npts, demo_mean, demo_sd)
data_values hist(data_values); mean(data_values); sd(data_values)
## [1] 5.034757
## [1] 1.966487
Set the number of replications nreps
and the (maximum) sample size
<- 1000 # number of replications (samples) for each sample size
nreps <- 100 # number of example sample sizes max_sample_size
Create several matrices to hold the individual replication results.
# matrix to hold means of each of the nreps samples
<- matrix(1:nreps)
mean_samp # matrices to hold means, sd’s and sample sizes for for each n
<- matrix(1:(max_sample_size-1))
average_means <- matrix(1:(max_sample_size-1))
sd_means <- matrix(1:(max_sample_size-1)) sample_size
Generate means for a range of sample sizes (1:max_sample_size)
for (n in seq(1,max_sample_size-1)) {
# for each sample size generate nreps samples and get their mean
for (i in seq(1,nreps)) {
<- sample(data_values, n+1, replace=T)
samp <- mean(samp)
mean_samp[i]
}# get the average and standard deviation of the nreps means
<- apply(mean_samp,2,mean)
average_means[n] <- apply(mean_samp,2,sd)
sd_means[n] <- n+1
sample_size[n] }
Take a look at the means and the standard errors. Note that means remain essentially constant across the range of sample sizes, while the standard errors decrease rapidly (at first) with increasing sample size.
plot(sample_size, average_means, ylim=c(4.5, 5.5), pch=16)
plot(sample_size, sd_means, pch=16)
head(cbind(average_means,sd_means,sample_size))
## [,1] [,2] [,3]
## [1,] 5.002194 1.3991343 2
## [2,] 4.992911 1.1747384 3
## [3,] 5.059321 0.9902242 4
## [4,] 5.031235 0.8585874 5
## [5,] 5.031041 0.8160077 6
## [6,] 5.015314 0.7535537 7
tail(cbind(average_means,sd_means,sample_size))
## [,1] [,2] [,3]
## [94,] 5.036694 0.1899069 95
## [95,] 5.041825 0.1985795 96
## [96,] 5.044742 0.2041062 97
## [97,] 5.036726 0.2012360 98
## [98,] 5.040909 0.2003633 99
## [99,] 5.038318 0.2000125 100
Verify that the standard error of the mean is sigma/sqrt(n)
plot(demo_sd/sqrt((2:max_sample_size)), sd_means, pch=16)
Generate some data values, this time from a uniform distribution
# data_values from a uniform distribution
<- runif(npts, 0, 1)
data_values hist(data_values); mean(data_values); sd(data_values)
## [1] 0.5087177
## [1] 0.2835326
Rescale these values so that they have the same mean (demo_mean
) and standard deviation (demo_sd
) as in the previous example,
# rescale the data_values so they have a mean of demo_mean
# and a standard deviation of demo_sd (standardize, then rescale)
<- (data_values-mean(data_values))/sd(data_values)
data_values mean(data_values); sd(data_values)
## [1] 4.369281e-18
## [1] 1
<- (data_values*demo_sd)+demo_mean
data_values hist(data_values); mean(data_values); sd(data_values)
## [1] 5
## [1] 2
Repeat the demonstration
for (n in seq(1,max_sample_size-1)) {
# for each sample size generate nreps samples and get their mean
for (i in seq(1,nreps)) {
<- sample(data_values, n+1, replace=T)
samp <- mean(samp)
mean_samp[i]
}# get the average and standard deviation of the nreps means
<- apply(mean_samp,2,mean)
average_means[n] <- apply(mean_samp,2,sd)
sd_means[n] <- n+1
sample_size[n]
}plot(sample_size, sd_means, pch=16)
head(cbind(average_means,sd_means,sample_size))
## [,1] [,2] [,3]
## [1,] 5.009442 1.4284970 2
## [2,] 5.012839 1.1621002 3
## [3,] 4.918761 0.9917057 4
## [4,] 5.004906 0.9018073 5
## [5,] 4.975955 0.8187180 6
## [6,] 5.022585 0.7554809 7
tail(cbind(average_means,sd_means,sample_size))
## [,1] [,2] [,3]
## [94,] 5.004614 0.2007629 95
## [95,] 5.015275 0.2047870 96
## [96,] 4.999232 0.2094253 97
## [97,] 4.996611 0.2028047 98
## [98,] 5.003051 0.1943208 99
## [99,] 5.003415 0.2052365 100
This demonstrates that the standard error of the mean is insensitive to the underlying distribution of the data_
The confidence interval provides a verbal or graphical characterization, based on the information in a sample, of the likely range of values within which the “true” or population mean lies. This example uses an artificial data set [cidat.csv]
cidat
is a data frame that can be generated as follows
# generate 4000 random values from the Normal Distribution with mean=10, and standard deviation=1
<- rnorm(mean=10, sd=1, n=4000)
NormDat # generate a "grouping variable" that defines 40 groups, each with 100 observations
<- sort(rep(1:40,100))
Group <- data.frame(cbind(NormDat, Group)) # make a data frame cidat
Attach and summarize the data set.
attach(cidat)
## The following objects are masked _by_ .GlobalEnv:
##
## Group, NormDat
summary(cidat)
## NormDat Group
## Min. : 6.322 Min. : 1.00
## 1st Qu.: 9.357 1st Qu.:10.75
## Median :10.030 Median :20.50
## Mean :10.030 Mean :20.50
## 3rd Qu.:10.703 3rd Qu.:30.25
## Max. :13.627 Max. :40.00
The idea here is to imagine that each group of 100 observations represents one possible sample of some underlying process or information set, that might occur in practice. These hypothetical samples (which are each equally likely) provide a mechanism for illustrating the range of values of the mean that could occur simply due to natural variability of the data, and the “confidence interal” is that range of values of the mean that enclose 90% of the possible mean values.
Get the means and standard errors of each group_
<- tapply(NormDat, Group, mean)
group_means <- tapply(NormDat, Group, sd)
group_sd <- tapply(NormDat, Group, length)
group_npts <- (group_sd/(sqrt(group_npts)))
group_semean mean(group_means)
## [1] 10.02969
sd(group_means)
## [1] 0.09699701
Plot the individual samples (top plot) and then the means, and their standard errors (bottom plot). Note the different scales on the plots.
# plot means and data
par(mfrow=c(2,1))
plot(Group, NormDat)
points(group_means, col="red", pch=16)
# plot means and standard errors of means
plot(group_means, ylim=c(9, 11), col="red", pch=16, xlab="Group")
points(group_means + 2.0*group_semean , pch="-")
points(group_means - 2.0*group_semean , pch="-")
abline(10,0)
The bottom plot shows that out the 40 mean values (red dots), 2 (0.05 or 5 percent) have intervals (defined to be twice the standard error either side of the mean, black tick marks) that do not enclose the “true” value of the mean (10.0).
Set the graphics window back to normal and detach cidat
.
par(mfrow=c(1,1))
detach(cidat)
The standard error of the mean, along with the knowledge that the sample mean is normally distributed allows inferences about the mean to made For example, questions of the following kind can be answered:
Here’s a short discussion of simple inferential statistics:
The next step toward statistical inference is the more formal development and testing of specific hypotheses (as opposed to the rather informal inspection of descriptive plots, confidence intervals, etc.)
“Hypothesis” is a word used in several contexts in data analysis or statistics:
A null hypothesis is never “proven” by a statistical test. Tests may only reject, or fail to reject, a null hypothesis.
There are two general approaches toward setting up and testing specific hypotheses: the “classical approach” and the “p-value” approach.
The steps in the classical approach:
The steps in the “p-value” approach are:
An illustration of an hypothesis test that is frequently used in practice is provided by the t-test, one of several “difference-of-means” tests. The t-test (or more particularly Student’s t-test (after the pseudonym of its author, W.S. Gosset) provides a mechanism for the simple task of testing whether there is a significant difference between two groups of observations, as reflected by differences in the means of the two groups. In the t-test, two sample mean values, or a sample mean and a theoretical mean value, are compared as follows:
The shape of the t distribution can be visualized as follows (for df=30):
<- seq(-3,3, by=.1)
x <- dt(x,3)
pdf_t plot(pdf_t ~ x, type="l")
You can read about the origin of Gosset’s pseudonum (and his contributions to brewing) here.
There are two ways the t-test is implemented in practice, depending on the nature of the question being asked and hence on the nature of the null hypotheis:
Example data sets:
Attach the example data, and get a boxplot of the data by group:
# t-tests
attach(ttestdat)
boxplot(Set1 ~ Group1)
Two-tailed t-test (are the means different in a general way?)
# two-tailed tests
t.test(Set1 ~ Group1)
##
## Welch Two Sample t-test
##
## data: Set1 by Group1
## t = -0.2071, df = 55.818, p-value = 0.8367
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.11841233 0.09622446
## sample estimates:
## mean in group 0 mean in group 1
## 7.988305 7.999399
The t-statistic is -0.2071 and the p-value = 0.8367, which indicates that the t-statistic is not significant, i.e. that there is little support for rejecting the null hypothesis that there is no difference between the mean of group 0 and the mean of group 1.
Two one-tailed t-tests (each evaluates whether the means are different in a specific way?)
t.test(Set1 ~ Group1, alternative = "less") # i.e. mean of group 0 is less than the mean of group 1
##
## Welch Two Sample t-test
##
## data: Set1 by Group1
## t = -0.2071, df = 55.818, p-value = 0.4183
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.07850556
## sample estimates:
## mean in group 0 mean in group 1
## 7.988305 7.999399
t.test(Set1 ~ Group1, alternative = "greater") # i.e. mean of group 0 is greater than the mean of group 1
##
## Welch Two Sample t-test
##
## data: Set1 by Group1
## t = -0.2071, df = 55.818, p-value = 0.5817
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.1006934 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 7.988305 7.999399
Notice that for each example, the statistics (t-statistic, means of each group), are identical, while the p-values, and confidence intervals for the t-statistic differ). The smallest p-value is obtained for the test of the hypothes that the mean of group 0 is less than the mean of group 1 (which is the observed difference). But, that difference is not significant (the p-value is greater than 0.05).
A a second example
boxplot(Set2 ~ Group2)
t.test(Set2 ~ Group2)
##
## Welch Two Sample t-test
##
## data: Set2 by Group2
## t = 6.9733, df = 57.372, p-value = 3.419e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2772856 0.5006463
## sample estimates:
## mean in group 0 mean in group 1
## 7.988305 7.599339
detach(ttestdat)
Here the t-statistic is relatively large and the p-value very small, lending support for rejecting the null hypothesis of no significant difference in the means (and accepting the alternative hypothesis that the means do differ). Remember, we haven’t “proven” that they differ, we’ve only rejected the idea that they are identical.
One assumption that underlies the t-test is that the variances (or dispersions) of the two samples are equal. A modification of the basic test allows cases when the variances are approximately equal to be handled, but large differences in variability between the two groups can have an impact on the interpretability of the test results:
Example data: [foursamples.csv]
t-tests among groups with different variances
attach(foursamples)
# nice histograms
<- seq(0.0, 20.0, by=1)
cutpts par(mfrow=c(2,2))
hist(Sample1, breaks=cutpts, xlim=c(0,20))
hist(Sample2, breaks=cutpts, xlim=c(0,20))
hist(Sample3, breaks=cutpts, xlim=c(0,20))
hist(Sample4, breaks=cutpts, xlim=c(0,20))
par(mfrow=c(1,1))
boxplot(Sample1, Sample2, Sample3, Sample4)
mean(Sample1)-mean(Sample2)
## [1] -0.2718703
t.test(Sample1, Sample2)
##
## Welch Two Sample t-test
##
## data: Sample1 and Sample2
## t = -1.7294, df = 997.9, p-value = 0.08404
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.58035326 0.03661273
## sample estimates:
## mean of x mean of y
## 10.72631 10.99818
mean(Sample3)-mean(Sample4)
## [1] -0.2676365
t.test(Sample3, Sample4)
##
## Welch Two Sample t-test
##
## data: Sample3 and Sample4
## t = -4.2308, df = 998, p-value = 2.543e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3917738 -0.1434991
## sample estimates:
## mean of x mean of y
## 10.73264 11.00027
mean(Sample1)-mean(Sample3)
## [1] -0.006325667
t.test(Sample1, Sample3)
##
## Welch Two Sample t-test
##
## data: Sample1 and Sample3
## t = -0.053011, df = 658.3, p-value = 0.9577
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2406330 0.2279816
## sample estimates:
## mean of x mean of y
## 10.72631 10.73264
mean(Sample2)-mean(Sample4)
## [1] -0.002091883
t.test(Sample2, Sample4)
##
## Welch Two Sample t-test
##
## data: Sample2 and Sample4
## t = -0.017387, df = 654.69, p-value = 0.9861
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2383369 0.2341532
## sample estimates:
## mean of x mean of y
## 10.99818 11.00027
detach(foursamples)
There is a formal test for equality of group variances that will be described with analysis of variance.