Geography 4/590: R for Earth-System Science
Spring 2019

Exercise Using R
Finish by Friday, April 19

1. Introduction

The aim of this exercise is to use R for doing some simple analyses of “typical” Earth-system science data sets (with “typical” in quotes because the data sets aren’t as big as many). The excersize parallels the lecture presentation, and genrally involves the simple modification of existing code (to, say, plot a different variable) than the creating of new code. (But modification of existing code is generally what one does.)

Note that the code below can be copied using the little clipboard links, and pasted into RStudio. For this first exercise, opening a new R script file (*.R) will probably be sufficient, but you may wish to also experiment with an RMarkdown “Notebook” file, or a full RMarkdown file. In RStudio, a new file can be created with the File > New File menu, or from the “new” icon on the toolbar. The first thing to do is to save the file with a sensible name, and in a sensible place (where you can find it again).

2. Read data sets

The first task is usually to read a data set, and this first part of the exercise simply repeats the lecture material. There are two basic sources for data to be read into R: 1) data (e.g. a .csv file or netCDF) on disk (or from the web), and 2) an existing R workspace (i.e. a *.RData) file that might contain, in addition to the input data, function or intermediate results, for example.

Read a .csv file

Although it is possible to browse to a particular file in order to open it (i.e. using the file.choose() function), for reproducibility, it’s better to explicitly specify the source (i.e. path) of the data. The following code reads a .csv file IPCC-RF.csv that contains radiative forcing values for various controls of global climate since 1750 CE, expressed in terms of Wm-2 contributions to the Earth’s preindustrial energy balance. The data can be downloaded from [https://pjbartlein.github.io/REarthSysSci/data/csv/IPCC-RF.csv] by right-clicking on the link, and saving to a suitable directory.

Here, the file is assumed to have been dowloaded to the folder /Users/bartlein/projects/geog490/data/csv_files/. (Note that this folder is not necessarly the default working directory of R, which can found using the function getwd())

Read the radiative forcing .csv file:

# read a .csv file
csv_path <- "/Users/bartlein/projects/geog490/data/csv_files/"
csv_name <- "IPCC-RF.csv"
csv_file <- paste(csv_path, csv_name, sep="")
IPCC_RF <- read.csv(csv_file)

Load a saved .RData file (from the web)

An alternative way of reading data is to read in an entire workspace (*.RData) at once. The following code reads those data from the web, by opening a “connection” to a URL, and then using the load() function (which also can be used to load a workspace file from a local folder).

# load data from a saved .RData file
con <- url("https://pjbartlein.github.io/REarthSysSci/data/RData/geog490.RData")
load(file=con) 
close(con) 

Note how the “Environment” tab in RStudio has become populated with individual objects.

3. Data set summaries

A quick look at the data can be gotten using the str() (“structure”) function, and a five number" (plus the mean) summary of the data can be gotten using the summary() function. Here the data are the Oregon cirques data set, which contains the location of cirque basins around the state, and an indicator variable that notes whether they currently contain glaciers or not.

summary(cirques)

Other summary functions provide other basic information on a data set:

# other summaries
names(cirques)
head(cirques)
tail(cirques)

4. Simple visualization

Univariate enumerative plots – Index plots

Here is the code for generating the index plot of oxygen-isotope values from the specmap data set.

# attach SPECMAP data, index plot
attach(specmap)
plot(O18)
# time-series plot
plot(Age, O18, ylim=c(2, -2), pch=16)

And here’s the first task/question:

Q1: Modify the above code to plot the variable specmap$Insol instead of specmap$O18, and describe the pattern of variability in that plot relative to the plot of O18: Is it similar? More regularly periodic? Completely random?

 

The answers don’t have to be elaborate, but should be full sentences, with capitalization, punctuation, etc.

Univariate summary plots – stripcharts, histograms and density plots

Here is the code for producing stripcharts, histograms and density plots of the O18 data.

# stacked stripchart
stripchart(O18, method="stack", pch=16, cex=0.5)   # stack points to reduce overlap
# histogram, more bars
hist(O18, breaks = 20)

Q2: Modify the code above to product a stripchart or histogram for Insol, being sure to experiment with the number of breaks. What is the distribution of Insol like relative to that of O18?

 

Here’s the code for a density plot:

# density plot, different bandwidth
O18.density <- density(O18, bw = 0.10)
plot(O18.density)

Q3: Once again, modify the code to get a density plot for Insol. Is the density plot consitent with the information provided by the histogram?

 

After finishing with an attached data set, it’s good practice to detach it using detach():

# detach specmap
detach(specmap)

Univariate summary plots – boxplots

Before working on the cirques dataframe, attach it, and get the names of the variables:

# attach cirques dataframe and getnames
attach(cirques)
names(cirques)

This will produce a quick-and-dirty map of the data:

# simple map
plot(Lon, Lat, type="n")
points(Lon, Lat, col=as.factor(Glacier))

Here is the code for producing boxplots of elevation, as a function of glaciation at present:

# boxplot of elevations of glaciated and not-glaciated cirques
boxplot(Elev ~ Glacier)

Q4: Modify the code above to produce paired (not-glaciated/glaciated) boxplots for longitude and latitude. Keeping in mind that longitude increases (i.e. becomes less negative) to the east, describe the spatial distribution of glaciated cirques relative to unglaciated ones.

 

5. Bivariate plots

Scatter diagrams

Here is the code for plotting cirque elevation as a function of longitude (i.e. longitude on the x-axis, elevation on the y-axis).

# cirques: Elev vs. Lon
attach(cirques)
plot(Elev ~ Lon, type="n")
points(Elev ~ Lon, pch=16, col=as.factor(Glacier))

Q5: Examine the relationship between cirque elevation and latitude. Is there any kind of systematic relationship, or is the relationship completely random?

 

6. Enhanced (ggplot2) graphics

Before continuing, it would be efficient to calculate a new variable, the July:January precipitation ratio for the Oregon climate-station data set. This can be conveniently done as follows, where the names() function used to verify that the variable has been added. (Note that if at the end of the current R session the workspace is not saved, this new variable will go away.)

# new variable
orstationc$pjulpjan <- orstationc$pjul / orstationc$pjan
names(orstationc)

Here is some code for plotting the new variable:

# load the `ggplot2` package
library(ggplot2)
# ggplot2 histogram
ggplot(orstationc, aes(x=pjulpjan)) + geom_histogram(binwidth = 0.05)
# ggplot2 boxplots
ggplot(orstationc, aes(y = pjulpjan)) + geom_boxplot() 

Q6: Describe the distribution of orstationc$pjulpjan. Is the distribution symmetric or not? Does the distribution of the precipitation ratio explain the particular scale cutpoints that have been used to map the precipitationratio variables?

 

Plot the precipitation-ratio data again, this time transformed by taking the logarithms (base 10) of the data:

# ggplot2 boxplots
ggplot(orstationc, aes(y = log10(pjulpjan))) + geom_boxplot() 

Q7: What does the log tranformation do to the distribution of the precipitation-ratio data?

 

Next, take a look at the relationship between the precipitation-ratio variable and elevation.

## ggplot2 scatter diagram of Oregon climate-station data
ggplot(orstationc, aes(elev, pjulpjan)) + geom_point() + geom_smooth(color = "blue") +
  xlab("Elevation (m)") + ylab("July:January Precipitation Ratio") +
          ggtitle("Oregon Climate Station Data")

Q8: Describe the relationship. One issue that might be influencing this relationship is that there is a strong relationship between longitude and elevation across the state. How might that be addressed by simple plots?

 

7. Simple maps

The following code produces a simple map of one of the Oregon climate-station data.

# Code chunk 1:  set up
library(rgeos)
library(maptools)
library(sp)
library(RColorBrewer)
library(classInt)

# Code chunk 2: set variable and get colors
plotvar <- orstationc$pjan # pick a variable to plot
plottitle <- "January Precipitation"
nclr <- 8 # number of colors
plotclr <- brewer.pal(nclr,"PuBu") # get nclr colors

# find equal-frequency intervals
class <- classIntervals(plotvar, nclr, style="quantile")
colcode <- findColours(class, plotclr)
cutpts <- round(class$brks, digits=1)
cutpts

# Code chunk 3: plot the shape file and the selected variable
plot(orotl_shp)

# add points
points(orstationc$lon, orstationc$lat, pch=16, col=colcode, cex=2)
points(orstationc$lon, orstationc$lat, cex=2) # draws black line around each point

# add legend
legend(x=-118, y=43.5, legend=names(attr(colcode, "table")),
    fill=attr(colcode, "palette"), cex=0.6, bty="n")

# add the title
title(plottitle)

8. Maps with ggplot2

The next code chunk translates the Oregon county shapefile into a “tidyverse” format suitable for plotting with ggplot2:

# fortify shapefile
orcounty_gg <- fortify(orcounty_shp)
head(orcounty_gg)

Here’s what the translated shapfile looks like, plotted using ggplot2:

# ggplot map of orcounty_gg
ggplot(orcounty_gg, aes(long,lat)) + geom_polygon(aes(group = group), color = "gray50", fill = NA) +
  coord_quickmap() + theme_bw()

The following code creates a “bubble plot” of the elevations of the climate station.

# bubble plot
ggplot(orstationc, aes(lon, lat))  +
  geom_polygon(aes(long, lat, group = group), orcounty_gg, color = "gray50", fill = NA) +
  geom_point(aes(size = elev)) +
  coord_quickmap() + theme_bw()

The following code will plot the July:January precipitation data using ggplot2. First, recode the data to a factor variable indicating the particular class interval each data value falls in. (And note the (quasi) logarithmic scale used).

# recode pjulpjan to a factor
cutpts <- c(0.0, .100, .200, .500, .800, 1.0, 1.25, 2.0, 5.0, 10.0)
pjulpjan_factor <- factor(findInterval(orstationc$pjulpjan, cutpts))
head(cbind(orstationc$pjulpjan, pjulpjan_factor, cutpts[pjulpjan_factor]))

Here is code for the map:

# ggplot2 map of pjulpjan
ggplot(orstationc, aes(lon, lat))  +
  geom_polygon(aes(long, lat, group = group), orcounty_gg, color = "gray50", fill = NA) +
  scale_color_brewer(type = "div", palette = "PRGn", aesthetics = "color", direction = 1,
    labels = c("0.0 - 0.1", "0.1 - 0.2", "0.2 - 0.5", "0.5 - 0.8", "0.8 - 1.0",
               "1.0 - 1.25", "1.25 - 2.0", "2.0 - 5.0", "5.0 - 10.0", "> 10.0"),
    name = "Jul:Jan Ppt. Ratio") +
  geom_point(aes(lon, lat, color = pjulpjan_factor), size = 3.0 ) +
  coord_quickmap() + theme_bw()

Note that the Purple to Green color scale spans the range of precipitation-ratio data for the state, as opposed to the whole of the western U.S.

Q9: Describe the pattern of July:January precipitation ratios across the state.

 

9. Univariate statistics

Here is code for summarizing the Oregon climate-station data:

# univariate descriptive statistics
summary(orstationc)

10. Bivariate descriptive statistics (correlations)

The following code produces a correlation matrix and corrplot() graphical depiction of the matrix. Note the use of the round() function to reduce the number of digits in the correlation. “Unwrap” the cor() function to see what the correlation matrix looks like unrounded.

# correlations among Oregon climate-station variables
round(cor(orstationc[2:10]), 3)
# corrplot of correlation matrix
library(corrplot)
corrplot(cor(orstationc[2:10]), method = "color")

11. Parallel-coordinate plots

Next, take a look at some parallel-coordinate plots for the Oregon climate-station data. First, create a new datafram, eliminating the character and x- and y-coordinate variables from the data frame. The the use of the “square-bracket” selection method to grab specific columns from the orstationc dataframe.

# create new data frame without text
orstationc2 <- data.frame(orstationc[2:10],orstationc[15])
names(orstationc2)

Here is the code for a parallel-coordinate plot:

# load library
library(GGally)
library(gridExtra)

# parallel coordinate plot
ggparcoord(data = orstationc2,
  scale = "uniminmax", alphaLines = 0.1) + ylab("") +
  theme(axis.text.x  = element_text(angle=315, vjust=1.0, hjust=0.0, size=10),
        axis.title.x = element_blank(), 
        axis.text.y = element_blank(), axis.ticks.y = element_blank() )

Next, explore the spatial patterns of the Oregon climate-station data (including the July:January precipitation-ratio variable). Here’s the code for a selection of points by latitude/longitude ranges:

# lon/lat window
lonmin <- -125.0; lonmax <- -120.0; latmin <- 42.0; latmax <- 49.0 
lon <- orstationc2$lon; lat <- orstationc2$lat
orstationc2$select_pts <- factor(ifelse(lat >= latmin & lat <= latmax & lon >= lonmin & lon <= lonmax, 1, 0))

# pcp
a <- ggparcoord(data = orstationc2[order(orstationc2$select_pts),],
  columns = c(1:10), groupColumn = "select_pts",
  scale = "uniminmax", alphaLines=0.3) + ylab("") +
  theme(axis.text.x  = element_text(angle=315, vjust=1.0, hjust=0.0, size=8),
        axis.title.x = element_blank(), 
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        legend.position = "none") +
  scale_color_manual(values = c(rgb(0, 0, 0, 0.3), "red"))
  
# map
b <- ggplot(orstationc2, aes(lon, lat))  +
  geom_polygon(aes(long, lat, group = group), orcounty_gg, color = "gray50", fill = NA) +
  geom_point(aes(lon, lat, color = select_pts), size = 1.0 ) +
  theme_bw() + 
  theme(legend.position = "none") +
  coord_quickmap() + scale_color_manual(values = c("gray", "red"))

grid.arrange(a, b, nrow = 1)

And here is the code for a variable-value selection:

# variable-value selection
cutpoint <- 0.2
v <- orstationc2$pjulpjan
v <- (v-min(v))/(max(v)-min(v))
orstationc2$select_pts <- factor(ifelse(v >= cutpoint, 1, 0))

# pcp
a <- ggparcoord(data = orstationc2[order(orstationc2$select_pts),],
                columns = c(1:10), groupColumn = "select_pts",
                scale = "uniminmax", alphaLines=0.3) + ylab("") +
  theme(axis.text.x  = element_text(angle=315, vjust=1.0, hjust=0.0, size=8),
        axis.title.x = element_blank(), 
        axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        legend.position = "none") +
  scale_color_manual(values = c(rgb(0, 0, 0, 0.3), "red"))

# map
b <- ggplot(orstationc2, aes(lon, lat))  +
  geom_polygon(aes(long, lat, group = group), orcounty_gg, color = "gray50", fill = NA) +
  geom_point(aes(lon, lat, color = select_pts), size = 1.0 ) +
  theme_bw() + 
  theme(legend.position = "none") +
  coord_quickmap() + scale_color_manual(values = c("gray", "red"))

grid.arrange(a, b, nrow = 1, ncol = 2)

Experiment with the longitudinal cutoff, as well as different variables for selecting points.

Q10: Describe the seasonality of precipitation and temperature across the state. Don’t worry if that’s hard to put into words–data-analytical visualizations are possibly the best instance of the “One picture is worth a thousand words.” maxim.