Exercise 3

NOTE: This page has been revised for Winter 2021, but may undergo further edits.

Geography 4/595: Geographic Data Analysis
Winter 2021

Exercise 3: Bivariate Plots & Descriptive Statistics
Finish by Friday, January 22

1. Introduction

The purpose of this exercise is to introduce some bivariate plots and basic descriptive statistics.

Read through the exercise before attempting to complete it.

Before starting the exercise, you may want to load the workspace contained in the geog495.RData file, which contains the individual .csv and shapefiles used in the exercises. See Section 6 of the “packages-and-data” instructions: [packages-and-data.html] Downloading this workspace eliminates having to read individual files, but the instructions for doing so are still included below.

Begin by starting RStudio. Check to make sure that you’re back in your working directory. You can verify that R is looking at the correct folder by clicking on File > Change dir… on the RGUI menu (Windows) or Misc > Change Working Directory… (Mac). If you’re in the folder you created in the first exercise, fine, otherwise you could browse to it.

2. Scatter Diagrams

Simple scatter diagrams

Check to make sure the Summit Creek “data frame” (the data set after it has been read into R’s workspace) is still in the workspace:

ls()

If not, re-read the sumcr.csv file as in Section 8 of Exercise 1. Also, use the names() function to get a reminder of the names of the variables in the data frame, and attach the data frame so you can refer to individual variables by their simple names (e.g. WidthWS as opposed to the complete full name sumcr$WidthWS):

names(sumcr)
attach(sumcr)

The scatter diagram (sometimes also referred to as a “scatter plot”, “scattergram” or “X-Y plot” is the most basic of multivariate (more that one variable) plots, and is the workhorse among bivariate plots. Produce a scatter diagram (with water-surface depth (DepthWS) as the Y-axis variable, and cumulative length (CumLen) as the X-axis variable:

plot(CumLen, DepthWS)   # X-axis variable first, Y-axis variable second

Produce a second scatter diagram using the above steps for water-surface width (WidthWS) versus bankfull width (WidthBF). On Windows, before creating this second plot, click on the R graphics window to give it the focus, and then use the menu command History > Recording to start saving copies of the plots so that you can use PgUp and PgDn to review previous plots.

Q1: What do the overall relationships between DepthWS and CumLen and between WidthWS and WidthBF look like? (Contrast them: Which is stronger, which is weaker? What is the sense of the relationships (positive or negative)?)

Detach the sumcr data frame before going on:

detach(sumcr)

Annotated scatter diagrams

Download the Oregon climate station data from this link: [orstationc.csv] and save it in your working directory. Read the data into R’s workspace using the following code:

orstationc <- read.csv("../data/orstationc.csv", as.is=T)

Note that you may have to modify the path to point to the particular folder where your data is. The path in the code above assumes that the folder structure that you set up in exercise 1 looks like

../geog495
      /data

… and that you downloaded the orstationc.csv dataset to ../geog495/data/.

(The “as.is=T” argument in the read.csv() function tells R not to convert text information (station and Name) into group-membership factors, which is the default behavior.)

If you’ve accidentally downloaded the file to some different place than your working directory, you can search for it while loading the file using the file.choose() function:

orstationc <- read.csv(file.choose(), as.is=T)

(but don’t do this if you were successful the first time.)

Attach the data frame, and create a crude map by plotting longitude on the X-axis and latitude on the Y-axis:

attach(orstationc)
plot(lon, lat)

Label the points with the station-name abbreviations using the text() function

text(lon, lat, labels=station)

(Note that text() won’t work unless there is a plot with the same X-axis and Y-axis variables already open.)

Now plot annual precipitation (pann) as a function of elevation (elev) (i.e. elev on the X-axis, pann on the Y-axis):

plot(elev, pann)

Although it might seem more intuitive to plot elevation on the vertical axis, this particular arrangement of variables is a strong convention in data analysis–elevation is the control and is plotted along the x-axis while precipitation is the response and plotted along the y-axis.

Q2: Describe the relationship. Are the data “well behaved” in the sense that most points fall in a distinct cloud of points, or are there outliers or unusual points?

Label individual points using the identify() function:

identify(elev, pann, labels=station)

To stop identifying points, hit Esc. The behavior of indentify() in RStudio is a little different than in the individual Windows or MacOS GUIs-—individual points are indicated with a “drop pin”-type marker that appears as they are selected, but the labels don’t appear until Esc is hit. The labels that appear then are “sticky” in that they remain identified if the identify() function is used again.

Q3: Use the crude map of station names as well as simple inspection of the data to describe local variations in the basic relationship between elevation and annual precipitation. It may be useful to create other scatter plots, say, of elevation and longitude, or latitude and longitude and annual precipitation to get further insight. (A quick way of seeing lots of plots is to use the plot() function with a list of numeric variables, e.g., plot(orstationc[,2:10]) – the “construct” [,2:10] means all rows, and columns 2 through 10 of the dataframe).) Unfortunately identify() only works on simple scatter plots.

Detach the orstationc data frame as above.

3. Descriptive Statistics

Simple descriptive statistics

This part of the exercise involves producing some descriptive statistics for the Summit Cr. data. The easiest way to get a sense of what information is provided by a particular statistic (say, the standard deviation of water-surface width within a particular reach) is to compare descriptive plots for the variable with the values of a descriptive statistic.

Produce some descriptive statistics for the variable WidthWS (This is almost too easy!) :

attach(sumcr)
summary(WidthWS)

By themselves, the statistics don’t mean much. However, you can contrast the amount of information provided by the numerical and graphical descriptions, and the efficiency with which it is presented. Get the summary statistics for bankfull width (WidthBF) as well.

Q4: Compare the two variables using the summary statistics and whatever descriptive plots that may seem useful. Which variable has typically larger values, which varies more from one observation to another?

Descriptive statistics by observation-group membership

The calculation of descriptive statistics for subsets of a data set provides one means for investigating the influence of another variable on the variations within a data set of a variable of interest.

Remember that the Summit Cr. study area is divided into three sections: an upstream reach (reach A, represented by an A in column 2) in which cattle are permitted to graze, a middle reach (reach B, represented by a B) from which cattle have been excluded, and a downstream reach (reach C, represented by a C), in which cattle were again permitted to graze.

Use the tapply() function to get descriptive statistics for WidthWS, stratified by reach.

tapply(WidthWS, Reach, summary)

The arguments of the function are 1) the name of the variable (WidthWS), the name of the grouping variable used to stratify the data or assign to groups (Reach), and the function to apply to the stratified data (in this case, summary()).

You should see three blocks of summary statistics, one for each reach. Before attempting to answer the following, also produce a univariate scatter diagram (see Section 1 of Exercise 2) that shows the values of WidthWS plotted versus observation number.

Q5: Describe the variations in the location (e.g. mean) and dispersion (i.e. variance or standard deviation) of water-surface width across the reaches (i.e. from the upstream grazed reach, the middle ungrazed reach, and the downstream grazed reach). Note: This is a purposefully vague question. Think about what kind of information you will need to present to adequately describe each reach, and to compare the values of WidthWS from one reach to another. You might think about a one-page display that would contain a couple of plots and an abbreviated table.

You can leave the sumcr data frame attached, because there is no overlap among variable names in sumcr and orstationc.

4. Correlations

Inferring the strength of a relationship (and sometimes even the sign) from a scatter plot is somewhat of a judgment call—e.g., what kind of pattern on a scatter plot indicates a “strong” relationship? Consequently, it would be convenient to have some kind of more objective measure of the strength of a relationship than simple visual interpretation. This is provided by the correlation coefficient—a single number description of the extent to which the relationship between two variables can be described by a straight line.

Attach the orstationc data frame again, and also plot pann “as a function of” elev (i.e. plot pann along the y-axis and elev on the x-axis. Then get the correlation coefficient between annual precipitation and elevation:

cor(elev, pann)

By itself, a single correlation coefficient is difficult to interpret, so you may wish to calculate correlations for other pairs of variables, and compare these to their scatter plots. Any easy way to do this to create a “correlation matrix” equivalent to the scatter plot matrix you created earlier:

plot(orstationc[,2:10])
cor(orstationc[,2:10])

Detach the orstationsc dataframe:

detach(orstationc)

Q6: Which pairs of variables are most highly correlated, and which less so? Are the correlations consistent with your subjective interpretation of the scatter plots or not?

Next, attach the sumcr dataframe again, and produce a scatter plot for CumLen vs WidthWS, using the alternative “equation” form of specifying the variables to be plotted.

attach(sumcr)
plot(WidthWS ~ CumLen)

In this form of specifying variables, the tilde (“~”) can be read as “as it varies with” or “as a function of”.

Also get the correlation between these two variables.

Q7: Describe the relationship evident in the scatter plot. (Is is simply a jumble of points, or is there some pattern in the plot?) Is relationship strong or weak? Now look at the correlation coefficient. Is its value consistent with your inference of the strength of the relationship? If so, does that make sense, if not, why not?

Now construct a smooth curve to summarize the relationship apparent on the scatter diagram using the loess() function: (note–if for some reason the plot of CumLen vs WidthWS disappeared, regenerate it using the above example.)

loess.model <- loess(WidthWS ~ CumLen)
loess.model
hat <- predict(loess.model)
lines(CumLen[order(CumLen)], hat[order(CumLen)], col="red")

The loess() function finds the smooth curve and saves the details (goodness-of-fit-statistics and fitted values) in the object loess.model, typing loess.model prints out the summary statistics, the predict() function puts the fitted values into the variable hat, and the lines() function draws the line on the current scatter plot.

Construct a scatter diagram of WidthWS plotted as a function of hat, and get the correlation between those variables.

Q8: What does the relationship between WidthWS and the fitted values (hat) from the loess curve look like? (Compare it with the scatter diagram of WidthWS and CumLen.) What does this WidthWS vs hat scatter diagram (and companion correlation coefficient) suggest now about the strength of the relationship between WidthWS and CumLen.?

5. What to hand in.

Answers to the questions, and the scatter plots for Q7 and Q8.