NOTE: This page has been revised for Winter 2021, but may undergo further edits.
Geography 4/595: Geographic Data Analysis
Winter 2021
Exercise 7: Regression Analysis
Finish by Friday, February 26
1. Introduction
The objective of this exercise is to illustrate regression analysis as well as multivariate plotting, which will be used to examine the data set that will be used in the regression analysis. The focus of the exercise is on the relationship between annual mean temperature at climate stations in Oregon and elevation and location as expressed by latitude and longitude. The specific goal will be to build a regression model that fits the response-variable data well, and does not suffer from assumption violations.
Here’s a link to a .csv file containing the data: [ortann.csv].
And here are links to the components of the Oregon county outline shape files: [orotl.shp] [orotl.dbf] [orotl.shx] [orotl.prj]
Read in the .csv file and the shapefile components if they are not already in the workspace.
2. Maps and scatter plots
The exercise will involve fitting a succession of models, the “goodness-of-fit” of which can be viewed by examining the patterns in maps of the residuals. These should show progressively less obvious patterns as the fit improves. Load the necessary packages:
# load the appropriate packages
library(sf)
library(RColorBrewer)
library(ggplot2)
Read the data file and shape file, if not already in the workspace.
# read the shapefile (if not already in the workspace)
="/Users/bartlein/Documents/geog495/data/shp/orotl.shp"
shapefile<- st_read(shapefile)
orotl_sf plot(orotl_sf)
# read the data file (if not already in the workspace)
<- "/Users/bartlein/Documents/geog495/data/csv/ortann.csv"
csv_path <- read.csv(csv_path) ortann
The first step in any regression analysis (or for that matter ANY analysis) is to look at the data. In the example below, the dependent or response variable will be annual temperature at the Oregon climate stations, and the predictors will be elevation and location (latitude and longitude). A quick way to get scatter plots is to produce a matrix plot
attach(ortann)
plot(ortann[,2:5])
The square bracket selection omits the first column of the ortann
data frame from the plot (which contains the station names).
As you proceed through the exercise, a set of maps will be generated that, except for the variables being plotted and the range of values of those variables, will be identical. That makes it efficient to construct a function for making the maps that will accommodate different variables and scales. Here is such a function, and to “install” into the workspace, simply run it at the command line.
<- function(title, varname, plotvar, limits) {
map_temp ggplot() +
geom_sf(data = orotl_sf, fill="grey70", color="black") +
geom_point(data = ortann, aes(x = longitude, y = latitude, color = plotvar), size = 5.0 ) +
scale_color_distiller(type = "div", palette = "RdBu", aesthetics = "color", direction = -1,
limits = limits, name = varname) +
labs(x = "Longitude", y = "Latitude", title = title) +
theme_bw()
}
There are three arguments that must be passed to the function: 1) the map title (a string), 2) a label for the variable (a string), 3) the name of the variable (a string), and 4) the variable-value limits (a pair of values, concatenated with c()
)
Plot the values of tann by setting the values for each of the arguments, and then calling the map_tmp()
function:
# map the annual tempeature values
<- "Oregon Climate Station Data -- Annual Temperature"
title <- "Annual Temp. (C)"
varname <- tann
plotvar <- c(-15, 15)
limits map_temp(title, varname, plotvar, limits)
3. An initial “null” or “naïve” model
The simplest of models is one in which a single value, usually the mean, is used as the “fitted value” or prediction for each observation. This model can be thought of as a “naïve” model—the only thing the response variables “knows about” is its own mean value, and nothing about the influence of the eventual predictor variables. It might be the case that the potential predictor variables (elevation, location) have no influence on the response variable, and in such a case, the mean might be the best predictor of the values of the response variable, but a rather boring one from the perspective of explanation.
The null or naive model can be fit using the lm() function that is used to fit more elaborate models.
# a null model, mean as the prediction
<- lm(tann ~ 1) tann_lm0
Note the “formula” in the function above; tann ~ 1
means “predict tann
as a function of a constant value” (namely the mean). The lm()
function is silent, meaning it does not produce any output itself, but it can always be examined by typing the name of the object it produces, e.g., tann_lm0
, or by applying the summary()
or attributes()
functions to that object
# plot the regression line
plot(tann ~ elevation)
abline(abline(mean(tann),0))
Here the regression line is simply a horizontal line at the mean value of the response variable. Another way of looking at this is that the slope of the regression line is 0, and the intercept is equal to the mean of the dependent or response variable.
# examine the model object
summary(tann_lm0)
The summary()
function here provides, in the case of this initial model, a somewhat overblown description of the mean and standard deviation of tann
, which can be verified using the mean()
and sd()
functions.
The plot()
function with a model object as an argument provides a sequence of four diagnostic plots. These include:
Before plotting, use the par()
function to set up a 2x2 arrangement of plots.
# standard regression diagnostics (4-up)
<- par(mfrow = c(2, 2))
oldpar plot(tann_lm0, which=c(1,2,4))
par(oldpar)
(Note that only three plots are produced here—the leverage plot is skipped for this model, because it doesn’t make any sense.))
The first use of the par()
function sets up the RGraphics window to contain four plots on a single page, and par(oldpar)
restores the original one-plot per page format. In the case of this “null” model, the normal QQ plot will provide information on the normality of the residuals, while the Cook’s distance plot indicates which observations have large influence on the mean.
The residuals (deviations from the mean) can be referenced as tann_lm0$residuals
. Map those values. Note that line limits <- max(abs(tann_lm0$residuals)) * c(-1, 1)
figures out a good range for the values, centered on zero. First the absolute values are found, then the maximum absolute value, and then the multiplication creates a pair of values ranging from -1 times the maximum absolute value to +1 time the maximum absolute value.
# map the residuals
<- "Residuals from tann_lm0 (C)"
title <- "tann_lm0$residuals"
varname <- tann_lm0$residuals
plotvar <- max(abs(tann_lm0$residuals)) * c(-1, 1)
limits map_temp(title, varname, plotvar, limits)
Q1: Compare the values and the patterns of the maps of
tann
andtann_lm0$residuals
. How are they alike, and how are they different?
4. Bivariate regression
From previous analyses, it’s apparent that tann
has a clear relationship with elevation
, and so the successive regressions will begin using elevation as a predictor. Before fitting the regression model, explore the relationship between tann and elevation
.
# first regression model -- tann ~ elev
<- lm(tann ~ elevation) tann_lm1
The lm()
function fits linear (straight-line) relationships between the response variable (tann
) and the predictor (elevation
). Note the formula used here: tann
varies as a function of elevation
.
# plot the regression line
plot(tann ~ elevation)
abline(tann_lm1, col="blue")
The plot()
and abline()
functions plot the data and draw regression line
# examine the model object
summary(tann_lm1)
Note that the summary table now includes the regression coefficients and several goodness-of-fit statistics.
# standard regression diagnostics (4-up)
<- par(mfrow = c(2, 2))
oldpar plot(tann_lm1, which=c(1,2,4,5))
par(oldpar)
Replot the residual scatter plot (residuals vs. fitted values), adding a lowess curve to emphasize the pattern.
# another view of the residual scatter diagram
plot(tann_lm1$fitted.values, tann_lm1$residuals)
lines(lowess(tann_lm1$fitted.values, tann_lm1$residuals, f=0.80), col="red")
Q2: Examine the regression output in the console window. Is there a significant relationship between tann and elevation? What are the F statistic, its p-value, the Multiple R-Squared value, and the residual standard error? Are the intercept and slope significantly different from zero (and how do you know)?
Q3: Describe the patterns on the residuals vs. fit plot and normal QQ plot. Do these plots suggest that there is no information in the residuals, or is there some kind of pattern? Are there any unusually influential observations? Is this first analysis ok, or should we consider modifying it?
(I’ll answer this one–there is a distinct arch-shaped pattern on the residuals vs fit plot, nicely summarized by the loess curve, and while the points are generally linear on the residual QQ plot, they begin to drift away from the 1:1 line for residual values greater than +1. The Crater Lake and Seneca values seem to have a large influence on the regression model. It looks like for this simple model, several of the underlying assumptions have been violated, and we should consider looking for a better model.)
Map the residual values from this simple linear regression model (tann_lm1$residuals
), and compare their pattern with those of from the null model.
# map the residuals
<- "Residuals from tann_lm1 (C)"
title <- "tann_lm1$residuals"
varname <- tann_lm1$residuals
plotvar <- max(abs(tann_lm0$residuals)) * c(-1, 1)
limits map_temp(title, varname, plotvar, limits)
Note that the map pattern of the residuals is less well organized than that for the naïve model.
5. Multiple regression
One of the assumptions that underlies regression analysis is that we have fit the right model, and one way of expressing the other assumptions that describe the properties of the residuals is that the residuals should appear free of any pattern. This is clearly not the case for the simple model, and the reason should be apparent–the relationship between tann and elevation is not linear. There are two possible solutions. One would be to fit a non-linear model, for example a parabola or quadratic curve to the relationship, but there really isn’t a physical justification for why the temperature decrease with increasing elevation should be non-linear in that particular fashion. An alternative approach is to look for additional predictors variables that might be able to account for the pattern in the residuals, and include these in a “multiple regression model.”
An obvious set of additional predictor variables for this exercise is provided by latitude and longitude. (In practice, additional variables might not immediately come to mind.) Check to see whether any of the “information” (or pattern) in the residuals might be explained by latitude and longitude by constructing scatter plots with loess curves for the residuals:
# plot residuals vs. other predictors
plot(tann_lm1$residuals ~ longitude)
lines(lowess(tann_lm1$residuals ~ longitude, f=0.80), col="red")
plot(tann_lm1$residuals ~ latitude)
lines(lowess(tann_lm1$residuals ~ latitude, f=0.80), col="red")
Q4: Describe the relationships between the residuals from the linear model and latitude and longitude. (Hint: It might be good to look at the maps of residuals too.)
The relationships you should have seen in the above question are important enough that we should consider including longitude and latitude as predictors in the regression model. Here’s a multiple regression model that includes location:
# second regression -- tann ~ elevation + latitude + longitude
<- lm(tann ~ elevation + latitude + longitude) tann_lm2
This code creates another model object tann_lm2
, which contains the results of the multiple regression with tann
as the response, and elevation
, latitude
, and longitude
as predictors. Examine the diagnostic plots and maps of the residuals from tann_lm2
. (Hint remember to change all instances of the model object and variable names in the blocks of code necessary to do this.)
Q5: Look at the summary for this model. Describe how the F-statistic, its p-value, the R2, and Residual Standard Error values have changed. Are all the predictors significant (Hint: examine their t-statistics.) Are any patterns in the residuals from this model less obvious than those from the previous model? Are they completely gone?
Several other multiple regression models with elevation, latitude, and longitude as predictors, (each allowing for interactions among the predictors), in a sense create additional predictors (latitude x longitude, elevation x latitude), which in the case of spatial data as here, allow the predicted “surface” more flexibility. Here are two such interaction models:
tann_lm3 <- lm(tann ~ elevation + latitude*longitude)
(allows for interactions between latitude and longitude, fitting a surface slightly more complex than a simple plane).
tann_lm4 <- lm(tann ~ elevation*latitude*longitude)
(simultaneous interactions among all three predictors)
Because the exercise continues below, you might infer that fitting these more elaborate models doesn’t quite take care of the model deficiencies noted above.
6. Nonparametric (Loess) regression
It could be the case that the individual and joint relationships between tann
and elevation
, latitude
and longitude
in the previous multiple regressions aren’t quite perfectly represented by linear relationships. The loess fitting procedure can be generalized to more than one predictor variable, which allows the flexibility of the loess curve to be extended to the multiple regression context. In essence, the method produces a number of local regression analyses, allowing the form of the relationship to vary across the space defined by the predictor variables, which in this example, turns out to be geographical space.
Fit a “local” (loess) regression model to the tann
data, using elevation
as a predictor:
# loess -- simplest model
<- loess(tann ~ elevation, span=0.80, degree=2)
tann_lo1
# examine the fit
summary(tann_lo1)
plot(tann ~ elevation)
<- predict(tann_lo1)
hat lines(elevation[order(elevation)], hat[order(elevation)], col="red")
cor(tann, hat)^2
The extraction of information from the model object (tann_lo1
) is a little different when fitting a loess regression model than when fitting a linear model using the lm()
function. Because loess regression is “nonparametric” there are no regression coefficients, and F- and t-statistics are not produced; however, the goodness-of-fit of the regression can still be determined using R2 and residual standard error values. For example, the cor()
function can be used to get the correlation between the response variable and the fitted values. Squaring this value gives a statistic comparable with the R2 values from earlier regressions.
# residual scatter diagram
plot(tann_lo1$fitted, tann_lo1$residuals)
lines(lowess(tann_lo1$fitted, tann_lo1$residuals, f=0.80), col="red")
# normal probablility plot
qqnorm(tann_lo1$residuals)
This block of code constructs the residual scatter diagram. Note that the fitted (predicted) values and the residuals are contained in the variables tann_lo1$fitted
, and tann_lo1$residuals
.
Map the residuals:
# map the residuals
<- "Residuals from tann_lo1 (C)"
title <- "tann_lo1$residuals"
varname <- tann_lo1$residuals
plotvar <- max(abs(tann_lo1$residuals)) * c(-1, 1)
limits map_temp(title, varname, plotvar, limits)
Q6: Compare the results of this regression with the first linear regression (tann_lm1), in particular compare the goodness of fit, residual scatter diagram, and map patterns of residual. Does fitting a flexible or local curve improve the fit relative to fitting a straight line?
Now fit an expanded model with elevation, latitude, and longitude as predictors (equivalent to tann_lm2
):
# loess -- elevation, latitude, and longitude
<- loess(tann ~ elevation + latitude + longitude, span=0.80, degree=2)
tann_lo2
# examine the fit
summary(tann_lo2)
<- predict(tann_lo2)
hat cor(tann, hat)^2
# residual scatter diagram
plot(tann_lo2$fitted, tann_lo2$residuals)
lines(lowess(tann_lo2$fitted, tann_lo2$residuals, f=0.80), col="red")
# normal probablility plot
qqnorm(tann_lo2$residuals)
And map the residuals again.
Q7: Does the loess regression fit the
tann
data better, worse, or about the same as the multiple regression? (Note: it might be easier to answer Q7 and Q8 simultaneously.) Examine the residual scatter diagrams and QQ Normal plots. Has there been any improvement (i.e. less pattern) in the residuals of thetann_lo2
model relative to those from the other regressions?
Q8: Which of the various models seem to be the optimal one, and why do you think it is?
7. What to turn in
Turn in answers to the above questions, and one or more of the plots you generated in section 2.
That’s all!