NOTE: This page has been revised
for the 2024 version of the course, but there may be some additional
edits.
Geography 4/590: R for Earth-System Science
Winter 2024
Exercise Using R
Finish by Monday, Jan 29
1. Introduction
The aim of this exercise is to use R for doing some simple analyses of “typical” Earth-system science data sets (with “typical” in quotes because the data sets aren’t as big as many). The excersize parallels the lecture presentation, and genrally involves the simple modification of existing code (to, say, plot a different variable) than the creating of new code. (But modification of existing code is generally what one does.)
Note that the code below can be copied using the little clipboard
links, and pasted into RStudio. For this first exercise, opening a new R
script file (*.R
) will probably be sufficient, but you may
wish to also experiment with an RMarkdown “Notebook” file, or a full
RMarkdown file. In RStudio, a new file can be created with the File >
New File menu, or from the “new” icon on the toolbar. The first thing to
do is to save the file with a sensible name, and in a sensible place
(where you can find it again).
2. Read data sets
The first task is usually to read a data set, and this first part of
the exercise simply repeats the lecture material. There are two basic
sources for data to be read into R: 1) data (e.g. a .csv file or netCDF)
on disk (or from the web), and 2) an existing R workspace (i.e. a
*.RData
) file that might contain, in addition to the input
data, function or intermediate results, for example.
Read a .csv file
Although it is possible to browse to a particular file in order to
open it (i.e. using the file.choose()
function), for
reproducibility, it’s better to explicitly specify the source
(i.e. path) of the data. The following code reads a .csv file
IPCC-RF.csv
that contains radiative forcing values for
various controls of global climate since 1750 CE, expressed in terms of
Wm-2 contributions to the Earth’s preindustrial energy
balance. The data can be downloaded from [https://pages.uoregon.edu/bartlein//RESS/csv_files/IPCC_RF.csv]
by right-clicking on the link, and saving to a suitable directory.
Here, the file is assumed to have been dowloaded to the folder
/Users/bartlein/projects/geog490/data/csv_files/
. (Note
that this folder is not necessarly the default working directory of R,
which can found using the function getwd()
, or set using
setwd("directory_name")
, where
"directory_name"
is the name of the directory you wish to
set the working directory to.)
This is an example of reading the radiative forcing .csv file: Don’t actually run this–the data will be read in along with a lot of other files in the next step.
# read a .csv file
csv_path <- "/Users/bartlein/projects/geog490/data/csv_files/"
csv_name <- "IPCC-RF.csv"
csv_file <- paste(csv_path, csv_name, sep="")
IPCC_RF <- read.csv(csv_file)
Load a saved .RData file (from the web)
An alternative way of reading data is to read in an entire workspace
(*.RData
) at once. The following code reads those data from
the web, by opening a “connection” to a URL, and then using the
load()
function (which also can be used to load a workspace
file from a local folder).
# load data from a saved .RData file
con <- url("https://pjbartlein.github.io/REarthSysSci/data/RData/geog490.RData")
load(file=con)
Note how the “Environment” tab in RStudio has become populated with individual objects.
3. Data set summaries
A quick look at the data can be gotten using the str()
(“structure”) function, and a five number” (plus the mean) summary of
the data can be gotten using the summary()
function. Here
the data are the “SPECMAP” oxygen-isotope data that (more-or-less)
represent the size of the ice sheets over the past 800,000 yrs, along
with the values of insolation (incoming solar radiation) at 65N.
Other summary functions provide other basic information on a data set:
4. Simple visualization
Univariate enumerative plots – Index plots
Here is the code for generating the index plot of oxygen-isotope
values from the specmap
data set.
And here’s the first task/question:
Q1: Modify the above code to plot the variable
specmap$Insol
instead ofspecmap$O18
, and describe the pattern of variability in that plot relative to the plot ofO18
: Is it similar? More regularly periodic? Completely random?
The answers don’t have to be elaborate, but should be full sentences, with capitalization, punctuation, etc.
Univariate summary plots – stripcharts, histograms and density plots
Here is the code for producing stripcharts, histograms and density
plots of the O18
data.
# stacked stripchart
stripchart(O18, method="stack", pch=16, cex=0.5) # stack points to reduce overlap
Q2: Modify the code above to product a stripchart or histogram for
Insol
, being sure to experiment with the number of breaks. What is the distribution ofInsol
like relative to that ofO18
?
Here’s the code for a density plot:
Q3: Once again, modify the code to get a density plot for
Insol
. Is the density plot consitent with the information provided by the histogram?
After finishing with an attached data set, it’s good practice to
detach it using detach()
:
Univariate summary plots – boxplots
Before working on the cirques
dataframe, attach it, and
get the names of the variables:
This will produce a quick-and-dirty map of the data:
Here is the code for producing boxplots of elevation, as a function of glaciation at present:
Q4: Modify the code above to produce paired (not-glaciated/glaciated) boxplots for longitude and latitude. Keeping in mind that longitude increases (i.e. becomes less negative) to the east, describe the spatial distribution of glaciated cirques relative to unglaciated ones.
5. Bivariate plots
Scatter diagrams
Here is the code for plotting cirque elevation as a function of longitude (i.e. longitude on the x-axis, elevation on the y-axis).
# cirques: Elev vs. Lon
plot(Elev ~ Lon, type="n")
points(Elev ~ Lon, pch=16, col=as.factor(Glacier))
Q5: Examine the relationship between cirque elevation and latitude. Is there any kind of systematic relationship, or is the relationship completely random?
6. Enhanced (ggplot2) graphics
First, load some packages:
The first package is the sf
package, which handles
spatial data, RColorBrewer
implements Cindy Brewer’s color
scales, and the third, classInt
is a untility packages that
facilitates recoding continuous numerical data into factors (or class
intervals) for mapping.
Before continuing, it would be efficient to calculate a new variable,
the July:January precipitation ratio for the Oregon climate-station data
set. This can be conveniently done as follows, where the
names()
function used to verify that the variable has been
added. (Note that if at the end of the current R session the workspace
is not saved, this new variable will go away.)
Here is some code for plotting the new variable:
Q6: Describe the distribution of
orstationc$pjulpjan
. Is the distribution symmetric or not? Does the distribution of the precipitation ratio explain the particular scale cutpoints that have been used to map the precipitation ratio variables?
Plot the precipitation-ratio data again, this time transformed by taking the logarithms (base 10) of the data:
Q7: What does the log tranformation do to the distribution of the precipitation-ratio data?
Next, take a look at the relationship between the precipitation-ratio variable and elevation.
## ggplot2 scatter diagram of Oregon climate-station data
ggplot(orstationc, aes(elev, pjulpjan)) + geom_point() + geom_smooth(color = "blue") +
xlab("Elevation (m)") + ylab("July:January Precipitation Ratio") +
ggtitle("Oregon Climate Station Data")
Q8: Describe the relationship. One issue that might be influencing this relationship is that there is a strong relationship between longitude and elevation across the state. How might that be addressed by simple plots?
7. Maps with ggplot2
The next code chunk plots the location of the Oregon climate-station locations:
# ggplot2 map of Oregon climate-station locations
ggplot() +
geom_sf(data = orotl_sf, fill=NA) +
geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = plot_factor), size = 5.0, col = "gray50", pch=16) +
labs(x = "Longitude", y = "Latitude") +
theme_bw()
There are two components to this map. The first is the basemap
outlines of Oregon counties, which are contained in the
orotl_sf
“simple features” object. The contents of
orotl_sf
can be viewed by simply typing
orotl_sf
at the command line in the Console window, or
mapped using the following:
The second component is the orstations_sf
object, which
contains the climate data itself:
Here’s a map of annual precipitation
orstations_sf$pann
:
# recode the annual precipitation data to a factor
cutpts <- c(0,200,500,1000,2000,9999)
plot_factor <- factor(findInterval(orstations_sf$pann, cutpts))
nclr <- 5
plotclr <- brewer.pal(nclr+1,"BuPu")[2:(nclr+1)]
# get the map
ggplot() +
geom_sf(data = orotl_sf, fill=NA) +
geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = plot_factor), size = 5.0, pch=16) +
scale_colour_manual(values=plotclr, aesthetics = "colour",
labels = c("0 - 200", "200 - 500", "500 - 1000", "1000 - 2000", "> 2000"),
name = "Ann. Precip.", drop=TRUE) +
labs(x = "Longitude", y = "Latitude") +
theme_bw()
The following code will plot the July:January precipitation ratio
data using ggplot2
. First, recode the data to a factor
variable indicating the particular class interval each data value falls
in. (And note the (quasi) logarithmic scale used).
# recode pjulpjan to a factor (note: cutpoints are general for Western N.A.)
cutpts <- c(0.0, .100, .200, .500, .800, 1.0, 1.25, 2.0, 5.0, 10.0)
pjulpjan_factor <- factor(findInterval(orstationc$pjulpjan, cutpts))
head(cbind(orstationc$pjulpjan, pjulpjan_factor, cutpts[pjulpjan_factor]))
Here is code for the map:
# ggplot2 map of pjulpjan
ggplot() +
geom_sf(data = orotl_sf, fill=NA) +
geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = pjulpjan_factor), size = 5.0, pch=16) +
scale_color_brewer(type = "div", palette = "PRGn", aesthetics = "color", direction = 1,
labels = c("0.0 - 0.1", "0.1 - 0.2", "0.2 - 0.5", "0.5 - 0.8", "0.8 - 1.0",
"1.0 - 1.25", "1.25 - 2.0", "2.0 - 5.0", "5.0 - 10.0", "> 10.0"),
name = "Jul:Jan Ppt. Ratio") +
labs(x = "Longitude", y = "Latitude") +
theme_bw()
Note that the Purple to Green color scale spans the range of precipitation-ratio data for the state, as opposed to the whole of the western U.S.
Q9: Describe the pattern of July:January precipitation ratios across the state.
8. Univariate statistics
There are actually two versions of the Oregon climate station data in
the geog490.RData
database: orstations_sf
is a
“simple features” spatial object (and data.frame), while
orstationc
is a simple data frame, read in from a .csv
file. The difference can be noted using the class()
and
head()
functions:
# classes of orstation_sf and orstationc
class(orstations_sf)
head(orstations_sf)
class(orstationc)
head(orstationc)
Generally, if there is any mapping to be done, the sf
(simple features) spatial data object is the preferred way to store the
data, but if the analysis focuses only on simple analyses, the
data.frame/spreadsheet format is fine.
Here is code for summarizing the data.frame
version of
the Oregon climate-station data, which has spatial information in it,
but is not a spatial object like orstation_sf
.
9. Bivariate descriptive statistics (correlations)
The following code produces a correlation matrix and
corrplot()
graphical depiction of the matrix. Note the use
of the round()
function to reduce the number of digits in
the correlation. “Unwrap” the cor()
function to see what
the correlation matrix looks like unrounded.
# corrplot of correlation matrix
library(corrplot)
corrplot(cor(orstationc[2:10]), method = "color")
10. Parallel-coordinate plots
Next, take a look at some parallel-coordinate plots for the Oregon
climate-station data. First, create a new datafram, eliminating the
character and x- and y-coordinate variables from the data frame. The the
use of the “square-bracket” selection method to grab specific columns
from the orstationc
dataframe.
# create new data frame without text
orstationc2 <- data.frame(orstationc[2:10],orstationc[15])
names(orstationc2)
Parallel-coordinate plots provide an elegant way to look for structure in multiple-variable data sets, especially spatial data. The plots are constructed by rescaling (0 to 1) the values of the individual variables, and then connecting the values for individual cases or observations with a line, with each variable represented by a “column” in the plot. If there is any pattern (spatial or otherwise) in the data groups of observations (individual lines) will show up, but if the data are simply random, the parallel-coordinate plot will look like a structureless jumble of lines.
Here is the code for a parallel-coordinate plot of the Oregon climate-station data:
# load library
library(GGally)
library(gridExtra)
# parallel coordinate plot
ggparcoord(data = orstationc2,
scale = "uniminmax", alphaLines = 0.1) + ylab("") +
theme(axis.text.x = element_text(angle=315, vjust=1.0, hjust=0.0, size=10),
axis.title.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank() )
The plot is a little messy, but a prominant “M” pattern emerges shown by high January and annual temperatures and low July temperatures.
Next, explore the spatial patterns of the Oregon climate-station data (including the July:January precipitation-ratio variable). Here’s the code for a selection of points by latitude/longitude ranges (i.e. identifying stations in western Oregon):
# lon/lat window
lonmin <- -125.0; lonmax <- -122.0; latmin <- 42.0; latmax <- 49.0
lon <- orstationc2$lon; lat <- orstationc2$lat
orstationc2$select_pts <- factor(ifelse(lat >= latmin & lat <= latmax & lon >= lonmin & lon <= lonmax, 1, 0))
# pcp
a <- ggparcoord(data = orstationc2[order(orstationc2$select_pts),],
columns = c(1:10), groupColumn = "select_pts",
scale = "uniminmax", alphaLines=0.3) + ylab("") +
theme(axis.text.x = element_text(angle=315, vjust=1.0, hjust=0.0, size=8),
axis.title.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
legend.position = "none") +
scale_color_manual(values = c(rgb(0, 0, 0, 0.3), "red"))
# map
b <- ggplot() +
geom_sf(data = orotl_sf, fill=NA) +
geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = orstationc2$select_pts), size = 4.0, pch=16) +
theme_bw() +
theme(legend.position = "none") +
scale_color_manual(values = c("gray", "red"))
grid.arrange(a, b, nrow = 1)
And here is the code for a variable-value selection, where the station (point) selection identifies stations with high July to January precipitation ratios:
# variable-value selection
cutpoint <- 0.2
v <- orstationc2$pjulpjan
v <- (v-min(v))/(max(v)-min(v))
select_pts <- factor(ifelse(v >= cutpoint, 1, 0))
# pcp
a <- ggparcoord(data = orstationc2[order(orstationc2$select_pts),],
columns = c(1:10), groupColumn = "select_pts",
scale = "uniminmax", alphaLines=0.3) + ylab("") +
theme(axis.text.x = element_text(angle=315, vjust=1.0, hjust=0.0, size=8),
axis.title.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
legend.position = "none") +
scale_color_manual(values = c(rgb(0, 0, 0, 0.3), "red"))
# map
b <- ggplot() +
geom_sf(data = orotl_sf, fill=NA) +
geom_point(aes(orstations_sf$lon, orstations_sf$lat, color = select_pts), size = 4.0, pch=16) +
theme_bw() +
theme(legend.position = "none") +
scale_color_manual(values = c("gray", "red"))
grid.arrange(a, b, nrow = 1, ncol = 2)
Experiment with the longitudinal cutoff, as well as different variables for selecting points.
Q10: Describe the seasonality of precipitation and temperature across the state. Don’t worry if that’s hard to put into words–data-analytical visualizations are possibly the best instance of the “One picture is worth a thousand words.” maxim.