NOTE: This page has been revised for Winter 2021, but may undergo further edits.
Geography 4/595: Geographic Data Analysis
Exercise 5: Some data wrangling and matrix algebra
Finish by Friday, February 12
The first aim of this exercise is to illustrate the idea of “data wrangling” or the reshaping or restructuring of input data into the “tidy” form (of a rectangular data set) with variables in columns and observations or cases in rows. The second part of the exercise consists of a few examples that illustrate the features and application of matrix algebra.
2. Data and packages
A the concept of data wrangling, or the reshaping of non-rectangular data set into a rectangular one, can be illustrated using a small sample of monthly climate data for Eugene. These data are not currently part of the
geog495.RData workspace file (because they may be read in in different ways–as data frames or “tibbles”), but they can be downloaded here:
The full data set can be downloaded from here: EugeneClim.csv
(Download these to your current working directory, which can be found using
Also install the
tidyverse package, which in turn installs a number of individual packages that are used in reshaping data.
# install the "tidyverse" suite of packages install.packages("tidyverse") # library library(tidyverse)
Read in a typical “tidy”
.csv file, that has variables in columns and observations in rows. This can be done in the usual way using the
read.csv() function, which creates a standard data frame, or, if the
reader packages has been loaded by
library(tidyverse), using the
read_csv() function, which creates a “tibble”. The data here are a “short” three-year-long subset of Eugene monthly climate data.
# read a .csv file using the `readr` package <- "/Users/bartlein/Documents/geog495/data/csv/EugeneClim-short.csv" csv_file <- read_csv(csv_file) eugclim eugclim
(In the above, you would substitute the path to your working directory.)
Produce a few plots to look at the time series of individual variables, and to look at the annual cycle of each. The first use of
plot() below plots the time series of monthly average temperature (
tavg), while the second illustrates what the annual cycle looks like.
# time-series plot plot(eugclim$tavg ~ eugclim$yrmn, type="o", pch=16, xaxp=c(2013, 2016, 3))
# by month plot(eugclim$mon, eugclim$tavg, pch=16, xaxp=c(1, 12, 1))
Repeat the plots for some other variables, in particular
prcp (monthly total precipitation).
(See (see https://pjbartlein.github.io/GeogDataAnalysis/lec08.html#variables for a listing of variables)
Q1. Describe the annual cycles of the temperature and moisture-related variables. When during the year is it colder and when is it warmer, and when is it wetter and when is it drier?
3. Transforming (reshaping) an alternatively shaped table of data
An alternative layout for the data table (of just the precipitation-related variables) has the data arranged with variables in rows and months in columns. Read those data in:
# alternative layout of precipitation data <- "/Users/bartlein/Documents/geog495/data/csv/EugeneClim-short-alt-pvars.csv" csv_file <- read_csv(csv_file) eugclim_alt eugclim_alt
Q2 Describe the different form of the two tables (
eugclim_alt). Can you think of a way to produce a time-series plot of precipitation using the data in
eugclim_alt? (If so, show the code for doing that, and if not, why not?)
Now use the
spread() functions from the
tidyr package to reshape the data. This is done here in two steps:
# reshape by gathering and spreading # 1) gather <- gather(eugclim_alt, `1`:`12`, key="month", value="cases") eugclim_alt2 $month <- as.integer(eugclim_alt2$month) eugclim_alt2eugclim_alt2
# 2) spread <- spread(eugclim_alt2, key="param", value=cases) eugclim_alt3 eugclim_alt3
Plot the reshaped data (
eugclim_alt3) to verify that they indeed have been reshaped correctly.
# plot the reshaped data $yrmn <- eugclim_alt3$year + (as.integer(eugclim_alt3$month)-1)/12 eugclim_alt3plot(eugclim_alt3$prcp ~ eugclim_alt3$yrmn, type="o", pch=16, col="blue", xaxp=c(2013, 2016, 3))
eugclim_alt3. What did the application of
gather()do in creating
eugclim_alt2, and what did the application of
spread()do in creating
Q4: What is the benefit of reshaping the data in R as opposed to simply doing that in Excel or a text editor?
4. A little matrix algebra
Create three matrices, A, B, and C:
# create three matrices # default fill method: byrow = FALSE <- matrix(c(6, 9, 12, 13, 21, 5), nrow=3, ncol=2) A A
# same elements, but byrow = TRUE <- matrix(c(6, 9, 12, 13, 21, 5), nrow=3, ncol=2, byrow=TRUE) B B
# a third matrix <- matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, ncol=3) C C
Q5 Describe the shapes of the three matrices. (Note that the
dim()function applied to a matrix (e.g.
dim(A)) displays the number of rows and the number of columns in the matrix.)
Add A and B:
# matrix addition <- A + B F F
Now try adding A and C:
<- A + CG
Q6: What happend? Can A and C be added? Why not? (Again, the
dim()function might be useful.)
Matrix multiplication (as distinct from element-by-element multiplication) produces a new matrix whose elements are sums of squares and cross products of the elements of matrices being multiplied (see matrix.pdf). Matrix multiplication uses
%*% as the operator.
“Postmultiply” the matrix C by A:
# matrix multiplication <- C %*% A Q Q
… and try to postmultiply A by B (e.g.
T <- A %*% B)
Q7: What happens here? What are the dimensions of C? What does the message
non-conformable argumentsimply about the shapes of A and B?
To illustrate matrix inversion (i.e. the matrix algebra version of scalar division), a realistic matrix can be used, in this case the correlation matrix of the temperature variables in the
orstationc data set:
# a realistic matrix, orstationc temperature-variable correlation matrix <- cor(cbind(orstationc$tjan, orstationc$tjul, orstationc$tann)) R R
Get the inverse of R:
# matrix inversion <- solve(R) Rinv Rinv
One property of the inverse matrix is that when pre- or postmultiplied by the original matrix, the identity matrix, I should be produced.
Q8: Check to see if
Rinvis indeed the inverse of
R. (Show the results of the check.)