NOTE: This page has been revised for Winter 2021, but may undergo further edits.

Geography 4/595: Geographic Data Analysis
Winter 2021

Exercise 6: Reference Distributions and Significance Testing (Confidence Intervals, t-tests, and Analysis of Variance)
Finish by Friday February 19

1. Introduction

The objective of this exercise is to use R to make some statistical inferences about the significance of particular data values, statistics derived from data, or of the observed differences between the means of groups of observations.

2. Confidence Interval for the Mean

The first part of the exercise uses the Specmap data set. (It’s probably already in your workspace, but if not here’s a link: [specmap.csv]) These data consist of three columns: 1) the age (in ka (kiloanno =1000 yrs)) of 783 observations of δ18O values (oxygen-isotope values that indicate the size of the continental ice sheets, with positive values indicating large, and negative values small, ice sheets), and insolation (incoming solar radiation, expressed as differences from present values in July at 65oN in Wm-2) for about the past 800,000 years. The first row in the data set is the present (0 ka) and row 783 represents the values for 782,000 years before present (782 ka). The last glacial maximum occurred around 20 ka.

It’s always a good idea to take a look at data before starting any analyses. Here we want to look at δ18O and insolation, each as a function of age (time), and so conventionally, the variable Age should appear on the X-axis, and O18 or Insol on the Y-axis. Because both the oxygen-isotope and insolation data are time series, a 2-D Line Plot (or line plot of X-Y pairs) of both insolation and δ18O values will quickly display the data (Hint: Because the Y-axis variables have differing scales, putting both on the same plot would lead to a plot that is difficult to interpret, so it would be better to produce two separate plots.)

Q1: Describe the characteristic pattern of variation through time shared by both series (e.g. do they seem to vary in a random way, or do they vary more systematically with one another?).

Next, take a look at the distribution of δ18O, using a histogram or density plot, as in Exercise 2:

Q2: What does the distribution of δ18O look like? (Normal or not? If not, how far away from normal-looking does it appear? How easy is it to tell by simply “eyeballing” the histogram?)

How unusual is the present-day value of δ18O (-2.09)? Are current conditions representative of those that prevailed, on average, throughout the part of the Quaternary represented by these data, or are they relatively unusual? To get an idea of how representative or unusual the present is, compare its value to a 90% confidence interval for the mean of the whole series. If the present value falls outside of that interval, then that would suggest that the present value would not be a good estimate of the long-term average δ18O, or conversely, that the long-term average δ18O is quite different from the present value.

The following code will get a 0.95 (or 95%) confidence interval for the mean δ18O,

# statistics
O18_mean <- mean(O18)
O18_sd <- sd(O18)
O18_npts <- length(O18)
O18_semean <- O18_sd/sqrt(O18_npts)
print(c(O18_mean, O18_sd, O18_npts, O18_semean))

# confidence intervals
O18_upperCI <- O18_mean + 2*O18_semean
O18_lowerCI <- O18_mean - 2*O18_semean
print(c(O18_lowerCI, O18_mean, O18_upperCI))

Here’s a plot of the data (labeled by Age), along with the mean (solid red line), and the confidence interval (as dashed lines) that will help with the interpretation.

# plot the data labeled by Age, and the the mean and confidence interval
plot(Age, O18, type="o", pch=16, cex=0.5)
text(Age, O18, Age, cex=0.5, pos=4, offset=0.2)
abline(O18_mean, 0.0, col="red")
abline(O18_upperCI, 0.0, col="red", lty="dashed")
abline(O18_lowerCI, 0.0, col="red", lty="dashed")

Q3: Does the present (0 ka) value of δ18O fall inside or outside of the .95 or 95% confidence interval for the mean of δ18O? How about the value 1000 years ago (1 ka, row 2)? How far back in the record does one have to go to find the first δ18O value that would not be considered significantly different from the long-term mean? (Hint: just look at the numbers in a View() window. Also, you will have to mentally interpolate between values. For example, the δ18O at 2 ka is -1.91, and that at 3ka is -1.83, so at any time between 2ka and 3ka the value of δ18O is somewhere between -1.91 and -1.83.)

Getting the standard error and confidence limits for a number of variables would quickly become tedious. R allows (and encourages) the creation of functions that can be “reused” for different variables. The following is such a function (descr()) that describes a variable:

# function for describing a variable
# usage:  descr(x), where x is the name of a variable
descr <- function(x) {
    x_mean <- mean(x); cat(" Mean=", x_mean, "\n")
    x_sd <- sd(x); cat(" StdDev=", x_sd, "\n")
    x_npts <- length(x); cat(" n=", x_npts, "\n")
    x_semean <- x_sd/sqrt(x_npts); cat("SE Mean=", x_semean, "\n")
    x_upperCI <- x_mean + 2*x_semean; cat(" Upper=", x_upperCI, "\n")
    x_lowerCI <- x_mean - 2*x_semean; cat(" Lower=", x_lowerCI, "\n")
    cat(" ", "\n")

Copy and paste the function into RStudio (making sure you include the last curly bracket), select all of it, and run it. Nothing will happen, but the function will appeear in the Environment, and can now be used as follows:


3. Probability Plots

The question concerning how well an observed distribution matches the normal distribution (Q2 above) can be answered more directly using a graphical display called a “quantile plot” or “QQ” plot. The normal probability plot (what R calls a QQ Normal (qqnorm() plot) is a scatter diagram containing the values of a particular variable (plotted along the x-axis) and the values that would be returned by the cumulative density function (cdf) of a normal distribution with the same mean and standard deviation as the variable under consideration (plotted along the y-axis). The idea is that if the data really are normally distributed, then the normal probability plot should appear as a straight line, and as the distribution gets less-and-less normal, the normal probability plot will become more-and-more nonlinear.

Get a normal probability plot or QQ Normal plot for the δ18O data:

# qqnorm plot with a line

# histogram and density plot for comparison
hist(O18, breaks=40, probability=T)

Q4: What does the normal probability plot indicate about the distributions of δ18O values? Does the pattern of the points on the plot suggest anything about the way in which the observed distribution of δ18O departs from the normal distribution (if it indeed does)? Repeat the above analysis for insolation (Insol). Which variable appears to be closest to one that is normally distributed?

4. Two-Sample Difference-of-Means Tests (t-tests)

The question of whether two groups of observations are similar often arises. Similarity between groups of observations could be indicated by differences in the descriptive statistics of each group, in group-wise descriptive plots or in the distributions of variables. The most common index of “differentness” of groups of observations is provided by a comparison of the means of particular variables among the groups of observations. The test statistic that is appropriate for comparing two means is the t-statistic, which can be thought of as a “scaled” difference between two means, where the observed difference between two means is scaled by an estimate of the uncertainty in or variability of the the means.

Download and read the data set [orsim.csv] (and call it orsim in R). This data set includes climate values (January and July temperature and precipitation) simulated by a regional climate model for present, and for potential future conditions when the concentration of carbon dioxide in the atmosphere has doubled, for a set of model grid points that cover Oregon. The variables named TJan1x, TJul1x, PJan1x and PJul1x are the simulated present values of temperature (T) and precipitation (P), while the variables with “2x” in their names are those for the “doubled CO2 scenarios” that are intended to describe the climate of the next century.

Note that the grid pattern in this particular data set is not parallel to the graticule (lat/lon lines). Check to make sure the Oregon county outlines shape files (orotl.shp) are available; if not, see Part 2 of Exercise 4. Here’s some code:

# load the appropriate packages

Read the data file and shape file for mapping (if not already in the workspace).

# read the data file (if not already in the workspace)
csv_path <- "/Users/bartlein/Documents/geog495/data/csv/orsim.csv"
orsim <- read.csv(csv_path)
# read the shapefile (if not already in the workspace)
orotl_sf <- st_read(shapefile)

Attach the dataframe.

# attach the dataframe

Calculate the change in, for example, January temperature and precipitation:

# calculate Future - Present differences
orsim$TJanDiff <- orsim$TJan2x - orsim$TJan1x
orsim$PJanDiff <- orsim$PJan2x - orsim$PJan1x

Here is the code for a couple of ggplot2 maps:

## ggplot2 map of temperature
ggplot() +
    geom_sf(data = orotl_sf, fill="grey70", color="black") +
    geom_point(data = orsim, aes(x = Lon, y = Lat, color = TJanDiff), size = 5.0 ) +
    scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
            limits = c(-6, 6), breaks = seq(-6, 6, by = 2), guide = 'colorbar', name = "TJanDiff (C)") +
    labs(x = "Longitude", y = "Latitude", title = "Oregon Future Climate Simulations") 

## ggplot2 map of precipitation
ggplot() +
    geom_sf(data = orotl_sf, fill="grey70", color="black") +
    geom_point(data = orsim, aes(x = Lon, y = Lat, color = PJanDiff), size = 5.0 ) +
    scale_fill_gradient2(low = "bisque4", mid= "white", high = "skyblue3", aesthetics = 'color',
           limits = c(-150, 150), breaks = seq(-150, 150, by = 50), guide = 'colorbar', name = "PJanDiff (mm)") +
    labs(x = "Longitude", y = "Latitude", title = "Oregon Future Climate Simulations") 

Before going on, compare the two blocks of code to see what sort of things have to be changed to plot different variables (e.g. the range of values that are plotted, choice of colors).

How would a doubling of carbon dioxide in the atmosphere affect temperature over Oregon? From what we already know about the greenhouse effect, atmospheric change, and basic climatology, we can suspect that the temperatures will increase with an increase in carbon dioxide. The question is, are the simulated temperatures for the “enhanced greenhouse”-situation significantly higher than those for present?

A two-sample t-test can be performed as follows:

# t-tests among groups with different variances
boxplot(TJan2x, TJan1x)
print(mean(TJan2x)-mean(TJan1x)) # the temperature difference
tt_TJan <- t.test(TJan2x, TJan1x)

Q5: What is the value of the t-statistic for a comparison of the simulated present and potential future January mean temperature over Oregon? Is this value large enough to be significant? (Hint: examine the p-value.)

In performing a t-test, the “default” null hypothesis, Ho, is that the means of the two groups of observations are equal, and the alternative hypothesis (Ha) is that the are not. A more explicit alternative hypothesis can be stated when, as is the case for this climate data, we expect the mean of one group (the present in this case) to be smaller than the mean of the other group (the doubled carbon dioxide scenario). In such a situation, where the potential sign of the difference between group means is known, a more “decisive” test can be made, by adopting a different, “one-sided” alternative hypothesis:

For January precipitation (PJan1x and PJan2x) test the hypotheses that PJan1x and PJan2x are equal and that PJan1x is less than PJan2x.

# two-tailed test
boxplot(PJan2x, PJan1x)
tt_PJan_1 <- t.test(PJan2x, PJan1x)

# one-tailed test
boxplot(PJan2x, PJan1x)
tt_PJan_2 <- t.test(PJan2x, PJan1x, alternative="greater")

Q6: Is future winter precipitation (PJan2x) the same as present (PJan1x), or is it significantly greater than present?

5. Analysis of Variance

Many times, more than two groups of observations exist. Although it is possible to compare pairs of groups of observations, what often happens is that some comparisons appear significant, some don’t, and it may be difficult to reach an overall conclusion about whether the groups jointly are significantly different. The procedure known as “analysis of variance” (ANOVA) allows a single overall measure of the difference in the means among more than two groups to be computed. The statistic generated in an analysis of variance is the F-statistic, which is a measure of the variability among groups of data relative to the variability within groups – as the group means become more distinct, the F-statistic increases in size, while as the groups become more internally variable, the F-statistic decreases.

Download and read in the Scandinavian EU Preference Vote data [scanvote.csv] if it’s not already in your workspace (call it scanvote in R, and attach it). These data describe the percentages of votes in favor of membership in the European Union (Yes(%)), by administrative district in Finland, Norway, and Sweden, collected by Alec Murphy. As always, plot the data first. A boxplot or stripchart by country would be good (e.g. boxplot(Yes ~ Country)). Use the descr() function from section 1 above and tapply() to calculate standard errors of the mean for each country, e.g.:

tapply(Yes, Country, descr)

Examine the text output and the plots to see how distinct the groups (countries) appear, and compare the means and standard errors of the mean to check the similarity of the means among countries.

Now do an analysis of variance:

# anova
aov1 <- aov(Yes ~ Country)

The F-statistic has as its reference distribution, the F-distribution (no surprise there). The F-distribution varies considerably with the number of groups and the number of observations within groups, and is not as easily visualized as the normal or t-distributions.

Also do the homogeneity of variance test.

hov1 <- bartlett.test(Yes ~ Country)

Q7: What is the value of the F-statistic, and how unusual is this value for the analysis of variance of these data? (Hint, look at the p-value, right next to the F statistic, lablelled Pr(>F).) Are the variances similar among groups? What does this suggest about the difference among countries in preference for joining the EU?

(See the short guide to interpreting test statistics, p-values, and significance for information on interpreting p-values [link].

There is an interesting summary plot of an analysis of variance object that provides further visualization of the differences among groups


Finally, download and read in the Summit Cr. geomorphic data project [sumcr.csv] if you don’t already have it, and do an analysis of variance on water-surface width (WidthWS), with group membership determined by Reach. Compare the extent of overlap of the confidence intervals for the group means and the values of the F-statistics for this analysis, and the previous one on preferences. You should see how the size of the F-statistic reflects the distinctiveness of the groups.

Q8 Decribe the results. Do water-surface widths differ significantly among reaches?

6. What to hand in

Hand in the answers to the above questions, plus the plot you made of the Scandinavian vote data.