Using R – Exercise

NOTE: This page has been revised for the 2024 version of the course, but there may be some additional edits.

Geography 4/590: R for Earth-System Science
Winter 2024

Exercise Using R
Finish by Monday, Jan 29

1. Introduction

The aim of this exercise is to use R for doing some simple analyses of “typical” Earth-system science data sets (with “typical” in quotes because the data sets aren’t as big as many). The excersize parallels the lecture presentation, and genrally involves the simple modification of existing code (to, say, plot a different variable) than the creating of new code. (But modification of existing code is generally what one does.)

Note that the code below can be copied using the little clipboard links, and pasted into RStudio. For this first exercise, opening a new R script file (*.R) will probably be sufficient, but you may wish to also experiment with an RMarkdown “Notebook” file, or a full RMarkdown file. In RStudio, a new file can be created with the File > New File menu, or from the “new” icon on the toolbar. The first thing to do is to save the file with a sensible name, and in a sensible place (where you can find it again).

2. Read data sets

The first task is usually to read a data set, and this first part of the exercise simply repeats the lecture material. There are two basic sources for data to be read into R: 1) data (e.g. a .csv file or netCDF) on disk (or from the web), and 2) an existing R workspace (i.e. a *.RData) file that might contain, in addition to the input data, function or intermediate results, for example.

Read a .csv file

Although it is possible to browse to a particular file in order to open it (i.e. using the file.choose() function), for reproducibility, it’s better to explicitly specify the source (i.e. path) of the data. The following code reads a .csv file IPCC-RF.csv that contains radiative forcing values for various controls of global climate since 1750 CE, expressed in terms of Wm^-2 contributions to the Earth’s preindustrial energy balance. The data can be downloaded from [https://pages.uoregon.edu/bartlein//RESS/csv_files/IPCC_RF.csv] by right-clicking on the link, and saving to a suitable directory.

Here, the file is assumed to have been dowloaded to the folder /Users/bartlein/projects/geog490/data/csv_files/. (Note that this folder is not necessarly the default working directory of R, which can found using the function getwd(), or set using setwd("directory_name"), where "directory_name" is the name of the directory you wish to set the working directory to.)

This is an example of reading the radiative forcing .csv file: Don’t actually run this–the data will be read in along with a lot of other files in the next step.

# read a .csv file
csv_path <- "/Users/bartlein/projects/geog490/data/csv_files/"
csv_name <- "IPCC-RF.csv"
csv_file <- paste(csv_path, csv_name, sep="")
IPCC_RF <- read.csv(csv_file)

Load a saved .RData file (from the web)

An alternative way of reading data is to read in an entire workspace (*.RData) at once. The following code reads those data from the web, by opening a “connection” to a URL, and then using the load() function (which also can be used to load a workspace file from a local folder).

# load data from a saved .RData file
con <- url("https://pjbartlein.github.io/REarthSysSci/data/RData/geog490.RData")
load(file=con)

Note how the “Environment” tab in RStudio has become populated with individual objects.

3. Data set summaries

A quick look at the data can be gotten using the str() (“structure”) function, and a five number” (plus the mean) summary of the data can be gotten using the summary() function. Here the data are the “SPECMAP” oxygen-isotope data that (more-or-less) represent the size of the ice sheets over the past 800,000 yrs, along with the values of insolation (incoming solar radiation) at 65N.

summary(specmap)

Other summary functions provide other basic information on a data set:

# other summaries
names(specmap)
head(specmap)
tail(specmap)

4. Simple visualization

Univariate enumerative plots – Index plots

Here is the code for generating the index plot of oxygen-isotope values from the specmap data set.

# attach SPECMAP data, index plot
attach(specmap)
plot(O18)

# time-series plot
plot(Age, O18, ylim=c(2, -2), pch=16)

And here’s the first task/question:

Q1: Modify the above code to plot the variable specmap$Insol instead of specmap$O18, and describe the pattern of variability in that plot relative to the plot of O18: Is it similar? More regularly periodic? Completely random?

The answers don’t have to be elaborate, but should be full sentences, with capitalization, punctuation, etc.

Univariate summary plots – stripcharts, histograms and density plots

Here is the code for producing stripcharts, histograms and density plots of the O18 data.

# stacked stripchart
stripchart(O18, method="stack", pch=16, cex=0.5)   # stack points to reduce overlap

# histogram, more bars
hist(O18, breaks = 20)

Q2: Modify the code above to product a stripchart or histogram for Insol, being sure to experiment with the number of breaks. What is the distribution of Insol like relative to that of O18?

Here’s the code for a density plot:

# density plot, different bandwidth
O18.density <- density(O18, bw = 0.10)
plot(O18.density)

Q3: Once again, modify the code to get a density plot for Insol. Is the density plot consitent with the information provided by the histogram?

After finishing with an attached data set, it’s good practice to detach it using detach():

# detach specmap
detach(specmap)

Univariate summary plots – boxplots

Before working on the cirques dataframe, attach it, and get the names of the variables:

# attach cirques dataframe and getnames
attach(cirques)
names(cirques)

This will produce a quick-and-dirty map of the data:

# simple map
plot(Lon, Lat, type="n")
points(Lon, Lat, col=as.factor(Glacier))

Here is the code for producing boxplots of elevation, as a function of glaciation at present:

# boxplot of elevations of glaciated and not-glaciated cirques
boxplot(Elev ~ Glacier)

Q4: Modify the code above to produce paired (not-glaciated/glaciated) boxplots for longitude and latitude. Keeping in mind that longitude increases (i.e. becomes less negative) to the east, describe the spatial distribution of glaciated cirques relative to unglaciated ones.

5. Bivariate plots

Scatter diagrams

Here is the code for plotting cirque elevation as a function of longitude (i.e. longitude on the x-axis, elevation on the y-axis).

# cirques: Elev vs. Lon
plot(Elev ~ Lon, type="n")
points(Elev ~ Lon, pch=16, col=as.factor(Glacier))

Q5: Examine the relationship between cirque elevation and latitude. Is there any kind of systematic relationship, or is the relationship completely random?

6. Enhanced (ggplot2) graphics

First, load some packages:

library(sf)
library(RColorBrewer)
library(classInt)

The first package is the sf package, which handles spatial data, RColorBrewer implements Cindy Brewer’s color scales, and the third, classInt is a untility packages that facilitates recoding continuous numerical data into factors (or class intervals) for mapping.

Before continuing, it would be efficient to calculate a new variable, the July:January precipitation ratio for the Oregon climate-station data set. This can be conveniently done as follows, where the names() function used to verify that the variable has been added. (Note that if at the end of the current R session the workspace is not saved, this new variable will go away.)

# new variable
orstationc$pjulpjan <- orstationc$pjul / orstationc$pjan
names(orstationc)

Here is some code for plotting the new variable:

# load the `ggplot2` package
library(ggplot2)

# ggplot2 histogram
ggplot(orstationc, aes(x=pjulpjan)) + geom_histogram(binwidth = 0.05)

# ggplot2 boxplots
ggplot(orstationc, aes(y = pjulpjan)) + geom_boxplot()

Q6: Describe the distribution of orstationc$pjulpjan. Is the distribution symmetric or not? Does the distribution of the precipitation ratio explain the particular scale cutpoints that have been used to map the precipitation ratio variables?

Plot the precipitation-ratio data again, this time transformed by taking the logarithms (base 10) of the data:

# ggplot2 boxplots
ggplot(orstationc, aes(y = log10(pjulpjan))) + geom_boxplot()

Q7: What does the log tranformation do to the distribution of the precipitation-ratio data?

Next, take a look at the relationship between the precipitation-ratio variable and elevation.

## ggplot2 scatter diagram of Oregon climate-station data
ggplot(orstationc, aes(elev, pjulpjan)) + geom_point() + geom_smooth(color = "blue") +
  xlab("Elevation (m)") + ylab("July:January Precipitation Ratio") +
          ggtitle("Oregon Climate Station Data")

Q8: Describe the relationship. One issue that might be influencing this relationship is that there is a strong relationship between longitude and elevation across the state. How might that be addressed by simple plots?

7. Maps with ggplot2

The next code chunk plots the location of the Oregon climate-station locations:

# ggplot2 map of Oregon climate-station locations
ggplot() +
  geom_sf(data = orotl_sf, fill=NA) +
  geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = plot_factor), size = 5.0, col = "gray50", pch=16) +
  labs(x = "Longitude", y = "Latitude") +
  theme_bw()

There are two components to this map. The first is the basemap outlines of Oregon counties, which are contained in the orotl_sf “simple features” object. The contents of orotl_sf can be viewed by simply typing orotl_sf at the command line in the Console window, or mapped using the following:

# plot st_geometry of orotl_sf
plot(st_geometry(orotl_sf), axes = TRUE)

The second component is the orstations_sf object, which contains the climate data itself:

# plot st_geometry of orstations_sf
plot(st_geometry(orstations_sf), axes = TRUE)

Here’s a map of annual precipitation orstations_sf$pann:

# recode the annual precipitation data to a factor
cutpts <- c(0,200,500,1000,2000,9999)
plot_factor <- factor(findInterval(orstations_sf$pann, cutpts))
nclr <- 5
plotclr <- brewer.pal(nclr+1,"BuPu")[2:(nclr+1)]

# get the map
ggplot() +
  geom_sf(data = orotl_sf, fill=NA) +
  geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = plot_factor), size = 5.0, pch=16) +
  scale_colour_manual(values=plotclr, aesthetics = "colour",
                      labels = c("0 - 200", "200 - 500", "500 - 1000", "1000 - 2000", "> 2000"),
                      name = "Ann. Precip.", drop=TRUE) +
  labs(x = "Longitude", y = "Latitude") +
  theme_bw()

The following code will plot the July:January precipitation ratio data using ggplot2. First, recode the data to a factor variable indicating the particular class interval each data value falls in. (And note the (quasi) logarithmic scale used).

# recode pjulpjan to a factor (note: cutpoints are general for Western N.A.)
cutpts <- c(0.0, .100, .200, .500, .800, 1.0, 1.25, 2.0, 5.0, 10.0)
pjulpjan_factor <- factor(findInterval(orstationc$pjulpjan, cutpts))
head(cbind(orstationc$pjulpjan, pjulpjan_factor, cutpts[pjulpjan_factor]))

Here is code for the map:

# ggplot2 map of pjulpjan
ggplot() +
  geom_sf(data = orotl_sf, fill=NA) +
  geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = pjulpjan_factor), size = 5.0, pch=16) +
  scale_color_brewer(type = "div", palette = "PRGn", aesthetics = "color", direction = 1,
                     labels = c("0.0 - 0.1", "0.1 - 0.2", "0.2 - 0.5", "0.5 - 0.8", "0.8 - 1.0",
                                "1.0 - 1.25", "1.25 - 2.0", "2.0 - 5.0", "5.0 - 10.0", "> 10.0"),
                     name = "Jul:Jan Ppt. Ratio") +
  labs(x = "Longitude", y = "Latitude") +
  theme_bw()

Note that the Purple to Green color scale spans the range of precipitation-ratio data for the state, as opposed to the whole of the western U.S.

Q9: Describe the pattern of July:January precipitation ratios across the state.

8. Univariate statistics

There are actually two versions of the Oregon climate station data in the geog490.RData database: orstations_sf is a “simple features” spatial object (and data.frame), while orstationc is a simple data frame, read in from a .csv file. The difference can be noted using the class() and head() functions:

# classes of orstation_sf and orstationc
class(orstations_sf)
head(orstations_sf)
class(orstationc)
head(orstationc)

Generally, if there is any mapping to be done, the sf (simple features) spatial data object is the preferred way to store the data, but if the analysis focuses only on simple analyses, the data.frame/spreadsheet format is fine.

Here is code for summarizing the data.frame version of the Oregon climate-station data, which has spatial information in it, but is not a spatial object like orstation_sf.

# univariate descriptive statistics
summary(orstationc)

9. Bivariate descriptive statistics (correlations)

The following code produces a correlation matrix and corrplot() graphical depiction of the matrix. Note the use of the round() function to reduce the number of digits in the correlation. “Unwrap” the cor() function to see what the correlation matrix looks like unrounded.

# correlations among Oregon climate-station variables
round(cor(orstationc[2:10]), 3)

# corrplot of correlation matrix
library(corrplot)
corrplot(cor(orstationc[2:10]), method = "color")

10. Parallel-coordinate plots

Next, take a look at some parallel-coordinate plots for the Oregon climate-station data. First, create a new datafram, eliminating the character and x- and y-coordinate variables from the data frame. The the use of the “square-bracket” selection method to grab specific columns from the orstationc dataframe.

# create new data frame without text
orstationc2 <- data.frame(orstationc[2:10],orstationc[15])
names(orstationc2)

Parallel-coordinate plots provide an elegant way to look for structure in multiple-variable data sets, especially spatial data. The plots are constructed by rescaling (0 to 1) the values of the individual variables, and then connecting the values for individual cases or observations with a line, with each variable represented by a “column” in the plot. If there is any pattern (spatial or otherwise) in the data groups of observations (individual lines) will show up, but if the data are simply random, the parallel-coordinate plot will look like a structureless jumble of lines.

Here is the code for a parallel-coordinate plot of the Oregon climate-station data:

# load library
library(GGally)
library(gridExtra)

# parallel coordinate plot
ggparcoord(data = orstationc2,
  scale = "uniminmax", alphaLines = 0.1) + ylab("") +
  theme(axis.text.x  = element_text(angle=315, vjust=1.0, hjust=0.0, size=10),
        axis.title.x = element_blank(), 
        axis.text.y = element_blank(), axis.ticks.y = element_blank() )

The plot is a little messy, but a prominant “M” pattern emerges shown by high January and annual temperatures and low July temperatures.

Next, explore the spatial patterns of the Oregon climate-station data (including the July:January precipitation-ratio variable). Here’s the code for a selection of points by latitude/longitude ranges (i.e. identifying stations in western Oregon):

# lon/lat window
lonmin <- -125.0; lonmax <- -122.0; latmin <- 42.0; latmax <- 49.0
lon <- orstationc2$lon; lat <- orstationc2$lat
orstationc2$select_pts <- factor(ifelse(lat >= latmin & lat <= latmax & lon >= lonmin & lon <= lonmax, 1, 0))

# pcp
a <- ggparcoord(data = orstationc2[order(orstationc2$select_pts),],
                columns = c(1:10), groupColumn = "select_pts",
                scale = "uniminmax", alphaLines=0.3) + ylab("") +
  theme(axis.text.x  = element_text(angle=315, vjust=1.0, hjust=0.0, size=8),
        axis.title.x = element_blank(),
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        legend.position = "none") +
  scale_color_manual(values = c(rgb(0, 0, 0, 0.3), "red"))

# map
b <- ggplot()  +
  geom_sf(data = orotl_sf, fill=NA) +
  geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = orstationc2$select_pts), size = 4.0, pch=16) +
  theme_bw() +
  theme(legend.position = "none") +
  scale_color_manual(values = c("gray", "red"))

grid.arrange(a, b, nrow = 1)

And here is the code for a variable-value selection, where the station (point) selection identifies stations with high July to January precipitation ratios:

# variable-value selection
cutpoint <- 0.2
v <- orstationc2$pjulpjan
v <- (v-min(v))/(max(v)-min(v))
select_pts <- factor(ifelse(v >= cutpoint, 1, 0))

# pcp
a <- ggparcoord(data = orstationc2[order(orstationc2$select_pts),],
                columns = c(1:10), groupColumn = "select_pts",
                scale = "uniminmax", alphaLines=0.3) + ylab("") +
  theme(axis.text.x  = element_text(angle=315, vjust=1.0, hjust=0.0, size=8),
        axis.title.x = element_blank(),
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        legend.position = "none") +
  scale_color_manual(values = c(rgb(0, 0, 0, 0.3), "red"))

# map
b <- ggplot()  +
  geom_sf(data = orotl_sf, fill=NA) +
  geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = select_pts), size = 4.0, pch=16) +
  theme_bw() +
  theme(legend.position = "none") +
  scale_color_manual(values = c("gray", "red"))

grid.arrange(a, b, nrow = 1, ncol = 2)

Experiment with the longitudinal cutoff, as well as different variables for selecting points.

Q10: Describe the seasonality of precipitation and temperature across the state. Don’t worry if that’s hard to put into words–data-analytical visualizations are possibly the best instance of the “One picture is worth a thousand words.” maxim.