NOTE: This page has been revised
for the 2024 version of the course, but there may be some additional
edits.
Geography 4/590: R for Earth System Science
Winter 2024
Task 1: Getting and using R and RStudio
Finish by Tuesday, Jan 16 (or soon thereafter, at least before
Week 3)
1. Introduction
The object of this exercise is to install and set up R and RStudio, and to experiment with some basic procedures. R is actually a computer language (that is quite similar to the S language for data analysis and visualization developed at AT&T’s Bell Labs), but is best thought of as an “environment” for producing both numerical and graphical analyses of data. R has several advantages for us here, because
The home page for the “R project” is at http://www.r-project.org
Both the Mac and Windows versions of R have their own built-in GUIs (Graphical User Interfaces), but they are a little idiosyncratic. RStudio (https://www.rstudio.com) is a free and open-source environment for running R, and it looks and behaves virtually the same in both Windows and OS X or MacOS, and so it will be used throughout the course.
2. Getting R
There are several options for running R: 1) downloading R to an accessible “personal” machine, 2) running it in a lab, like SSIL, and 3) running R and RStudio in a Windows 10 virtual machine. The latter works fine, but has several disadvantages that can be worked around (the main one being that data has to copied somewhere at the end of a session, or it will be lost), and the virtual machine can be accessed from a browser. Directions for using the virtual machine can be found here: Virtual Machine Instructions. If you will be using the virtual machine, it would be useful to still read the following, but you can skip sections 2 and 4 below.
R can be downloaded from one of the “CRAN” (Comprehensive R Archive Network) sites. In the US, the main site is at https://cran.r-project.org To download R, go to a CRAN website, and look in the “Download and Install R” area. Click on the appropriate link.
Windows 10 and 11
Note: Depending on the age of your computer and version of Windows, you may be running either a “32-bit” or “64-bit” version of the Windows operating system. If you have the 64-bit version (most likely), R will install the appropriate version (R x64 4.3.2) and will also (for backwards compatibility) install the 32-bit version (R i386 4.3.2). You can run either, but you will probably just want to run the 64-bit version.
MacOS
On the “R for Mac OS X” page (https://cran.r-project.org/bin/macosx/), there are multiple packages that could be downloaded, but there are two choices for “newer” versoins of MacOS (There are also packages for older versions):
R-4.3.2-arm64.pkg
if you’re running a Mac with an Apple
Silicon chip,R-4.3.2-x86_64.pkg
if you’re running a Mac with an
Intel chipR-4.3.2-arm64.pkg
” link.There are three sort of technical “FAQ” pages that contain additional information that may be useful for working out the kinks. These include
3. Set Up R
Both the Windows and MacOS versions of R come with built-in GUI’s
(graphical user interfaces) that are broadly similar, but there are
slight differences in how each works, and what a “best practices”
workflow and set of working folders looks like. Thee differences are
obviated by using RStudio (see below). R uses a “working folder” to
store its workspace (an .RData
file, invisible on a Mac),
script files *.R
, saved plots (as .pdf
or
.png
) files, etc.
Windows
To create a working folder,
geog495
” or
something). Pick a sensible location for this folder; on Windows 10,
probably in the c:/Users/xxxx/Documents/
folder
(e.g. c:/Users/bartlein/Documents/
) (Note that in “modern”
versions of Windows, file paths can be constructed with forward
slashes.)geog495
folder you just created
called data
(File > New > Folder
etc.).
That folder will be empty at first.On the Windows 10 virtual machine,
Student_Data
folder, anddata
(File > New > Folder
etc.). The folder will be empty
at first.MacOS
To create a working folder, the procedure is similar to that on Windows
User/Documents
folder (where User is your user name),
or alternatively, User/Projects, because the /Documents folder can get
pretty complicated over time.File > New Folder
, and create a new folder
in your User/Documents
folder and name this
geog495
data
.So to summarize, the working folders/directories might be be:
C:/Users/userid/Documents/geog495/
R:/geog495_1/Student_Data/userid/
User/userid/Documents/geog495/
And the /data
folders should be
C:/Users/userid/Documents/geog495/data/
R:/geog495_1/Student_Data/userid/data/
User/userid/Documents/geog495/data/
where userid
is your specific userid.
In practice, it may be useful to create multiple working folders to keep different projects separate, and you’re free to name the folders anything you want. The rest of this exercise assumes that they were names as above.
4. Installing and Using RStudio
RStudio (http://www.rstudio.com) is an IDE (integrated development environment) that provides a consistent environment for running R across different platforms (i.e. Windows, OS X or MacOS, Linux). The “environment” contains four “panes” two of which include the standard command-line “console” interface of R, and a code or script editor that is generally more useful that those built into the standard R applications for Windows or the Mac, plus two other panes that provide a graphics window, help window, workspace summary and so on. The panes are tiled, and remain in the foreground, making it a little easier to navigate around the different windows that appear in the Windows and Mac applications. The IDE also provides other nice features that assist coding in general (like autocompletion) and in doing the report writing and documentation required to do “reproducible research” and also developing R packages. RStudio is under continuous development (the current version is Version 1.3.1093), and so there are occasional problems that arise, but most are minor.
Installing RStudio is not too complicated. The RStudio page is at: https://www.rstudio.com, and after a few clicks you can choose the version for the particular operating system (Windows, OS X or MacOS, several flavors of Linux) that you’re using Here’s a direct link to the downloads page:
Note that the specific version numbers below may change as RStudio is updated. Note also that RStudio.com is now posit.com.
Windows
Mac
RStudio-1.3.1093.dmg
) by
clicking on it. This will open a dialog box that asks if you want to
open the file with the default DiskImageMounter application.RStudio is flexible enough in its layout (Tools > Options > Pane Layout), that individual work habits can be accommodated. A typical layout might be:
Useful menus in RStudio include:
In practice, R scripts (*.R) can be opened or created in the script editing pane. Individual lines of code (or the whole script) can be “run” or sent to the console by selecting them, and clicking on the “Run” icon at the top of the script pane, or by selecting and then pressing Ctrl-Enter (Windows) or Command-Enter (Mac). The standard R command line can also be used in the R Console pane.
Graphical output can be viewed in a larger format by using the Zoom tool on the Plots pane.
Another feature of RStudio is its ability to create “R Notebook” and “R Markdown” documents that combine text, code and the results of executing that code, an element of what is known as “Reproducible Research”. This feature will be discussed further as the course goes on.
5. Starting RStudio
To start the RStudio “gui” (graphical user interface), click on the start menu in Windows, and type RStudio) or click on the RStudio.app GUI (Mac) in the Applications folder (which you can copy to the Dock).
After a brief pause, RStudio will open, and you should see the message like this in the Console pane:
R version 4.3.2 (2023-10-31) – “Eye Holes” Copyright (C) 2023 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin20 (64-bit) …
6. Quitting R
There are several ways to quit R – clicking on the “close window”
button, typing File > Quit Session
from the RStudio
menu. RStudio will ask if you want to save the current workspace image
and any other files you’ve created or edited. In general, you’ll want to
do that, but there are cases when you might not want to (e.g. you’ve
accidentally deleted some intermediate results).
7. Getting Help
The first thing to do in learning new software is figure out how to get help. R has several approaches:
?quit
, you can also type help(quit)
.
(Note that typing ?quit
will be one of the few times in
which a function (quit()
) is typed without the
parentheses.help.start()
at the command line or using the
Help > R Help
on the RStudio menu.The key links on the help page are:
One of the issues with R is that error messages can be rather obscure. The most frequent sources of errors are simple typos, followed by those generate by copying and editing code. With time, you’ll develop a feel for what the error messages mean.
8. Projects in RStudio
One very nice feature of RStudio is its ability to create
Projects which help a lot in keeping data
(e.g. .csv
-type text files, or R’s internal
.RData
format) and scripts (e.g. files that end in
.R
, or .Rmd
) organized. Also, multiple
Projects can reside on the same machine (or user account on the
machine), which helps keep your work organized. Project folders can be
created internally in RStudio, but it may be easier to create the
folders outside of RStudio, and then use the
File > New Project > Existing Directory
dialog to
browse to that folder or directory. (Note (don’t do this now): A useful
folder or directory hierarchy would be created by using the two
subfolders or directories to the working directories described above,
the R
one for code, the .RData
workspace file,
and *.R
and *.Rmd
source files, and the other
data
to download data into. Then in the New Project dialog,
one would browse to, say
c:/Users/bartlein/Documents/geog495/R/
(Windows), or
User/Documents/R/
(MacOS) to create the Project file
(R.Rproj
), and download data to
c:/Users/bartlein/Documents/geog495/data/
(Windows), or
User/Documents/data/
(MacOS).) Projects can also be created
on the virtual machine.
It’s also possible to easily begin where you left off, by browsing to a Project file, and simply clicking on it to start RStudio.
9. A Data Set
The Summit Cr. geomorphic data consists of 88 observations of 11 variables along an 0.8-km stretch of Summit Cr. in eastern Oregon. This data set was collected by Pat McDowell, Frank Magilligan and their students as part of their study of the effects of cattle “exclosures” on the morphology of stream channels. They divided this stretch of Summit Cr. into individual “hydrologic units” (HU’s) that were either pools, shallow “riffles,” or straight “glides.” The overall study area is divided into three sections: an upstream reach (reach A) in which cattle are permitted to graze, a middle reach (reach B) from which cattle have been excluded, and a downstream reach (reach C), in which cattle were again permitted to graze.
The dataset contains the following information:
Col. | name | scale | R class | Definition |
---|---|---|---|---|
==== | ========== | =========== | ======= | ============================================================= |
1 | Location | alphanumeric | character | ID for a particular cross section |
2 | Reach | nominal | factor | Reach (A=upstream reach(grazed); B=exclosure (no cattle); C=downstream (grazed)) |
3 | HU | nominal | factor | Hydrologic unit type (P=pool; R=riffle; G=“glide” or straightwater stretch) |
4 | CumLen | ratio | numeric | cumulative distance downstream from the upstream end of the study area (m) |
5 | Length | ratio | numeric | length of a hydrologic unit (m) |
6 | DepthWS | ratio | numeric | depth of the channel from the water surface to the bottom (m) |
7 | WidthWS | ratio | numeric | width of the channel at the bankfull stage (m) |
8 | WidthBF | ratio | numeric | width of the channel at the bankfull stage (m) |
9 | HUAreaWS | ratio | numeric | area covered by the hydrologic unit at the water surface (sq m) |
10 | HUAreaBF | ratio | numeric | area covered by the hydrologic unit at the bankfull stage (sq m) |
11 | wsgrad | ratio | numeric | water-surface gradient (m/m, i.e. dimensionless |
The above table is sometimes referred to as a “codebook” that provides an expanded definition for each variable. (There is a tradeoff between shortish variable names, which are efficient to type, and longish variable names that are more self-explanatory.)
10. Importing the Data Set
The “working directory” issue.
NOTE: The directions described below will only work if the file being
downloaded is indeed downloaded to the /data
folder in the
working folder created earlier, and the folder is indeed the current
working folder in R. If you wind up downloading the file somewhere else,
like your /Downloads
folder, you should move it into your
working folder.
The current working folder can be discovered by typing the following in the Console window:
If you’re not in the working directory, you can use the RStudio
Session > Set Working Directory > Choose Working Directory...
dialog. You can also “strong-arm” the change of the working directory
using the setwd()
function:
setwd("C:/Users/bartlein/Documents/geog495/data/") # Windows
setwd("C:\\Users\\bartlein\\Documents\\geog495\\data\\") # classical Windows
setwd("R:/geog495_1/Student_Data/userid/") # virtual machine
setwd("R:\\geog495_1\\Student_Data\\userid\\") # virtual machine -- classical Windows foremat
setwd("/Users/userid/Documents/geog495/") # macOS
Note the use of either the forward slash or double backslash in
specifying the folder paths in Windows. (R uses a single backslash
“\
” as an operator, and so the first backslash “escapes”
the second, telling R to treat the combination like a single backslash.)
It’s easier to use the forward-slash format.
Reading data
NOTE: Punctuation, spelling and case are important. R is case
sensitive; in other words, Sumcr
is not the same thing as
sumcr
, and Read.csv
is not the same as
read.csv
.
R can read data from a number of different sources, including text (ascii) data and the .csv (comma separated values) format of Excel spreadsheets, as well as from an internal format, which is text-based, but not easily readable by humans. R stores the data, names of variables, etc. in an efficient form in its workspace (.Rdata) that can be saved and reloaded.
At the time of this writing, the most efficient way to open and import a new data set is in .csv format, which can be download from a web page, either the “data sets” page on the course web page, or from a link on one of the exercise pages like this one.
Importing a data set or shape file into R is a two-step procedure: 1) getting or downloading the data set from a server onto the computer you’re using, and 2) reading into R.
To download the Summit Cr. data set, (Step 1)
data
subfolder in the working folder
created above andRecall that the data folders are:
C:/Users/userid/Documents/geog495/data/
R:/geog495_1/Student_Data/userid/data/
/Users/userid/Documents/geog495/data/
To read the Summit Cr. data set into R on Windows, type the following:
while on the virtual machine, type the following
and on the Mac, type the following
where, as usual, "userid"
is your userid. Make sure that
the file paths are bracketed by quotation marks.
The read.csv()
function creates a data frame “object”
called “sumcr
” that contains the data from the .csv file.
Note that the data frame object doesn’t need to have the same name as
the file, especially if the filename is complicated. The
“<-
” arrow is called the “assignment operator”, which,
as it sounds, assigns whatever object is to its right to whatever object
is to its left, sometimes creating a new object in the process. In
reading a line of text, the operator is usually spoken as “gets” as in
“the dataframe sumcr
gets the contents of the
sumcr.csv
file.” In newer versions of R, the equals (=)
sign can be used, but in most existing texts and .pdf files, the
<-
version is used.
The advantage of the download-first-then-read-in approach is that you have an Excel-editable copy of the data set in your working folder.
An alternative approach for reading data is to use the
file.choose()
function to browse to a particular file:
This will open a “Select file…” dialog box. There’s a disadvantage to this approach in that it is not “reproducible”–at some later time, you may not be able to recall what file was read in to produce a particular result.
Looking at the data
The first thing to do is to check to see that R indeed has the Summit
Cr. data frame in its workspace. This can be done by typing
ls()
(the list function) at the command line, or (Windows)
clicking on Misc > List objects on the RGui menu.
The data frame can be examined by simply typing the name of the data
frame at the command line (e.g. sumcr
), which will create a
lot of output, or by typing head(sumcr)
, which lists the
first five lines (and guess what tail(sumcr)
does..)..
The names()
function can be used to get a list of the
variables in a data frame, e.g.:
The individual variables are referred to by a “compound” name
consisting of the data frame name and the variable name, joined by a
dollar sign ($
), e.g. sumcr$WidthWS
Note that
variable names are case-sensitive too (e.g. the name
sumcr$WidthWS
is not the same as
sumcr$widthws
.) This manner of referring to variables can
be made less cumbersome by using the attach()
function. For
example, try typing the following (don’t type the material in
parentheses, or the comments within a line, just the text in the Courier
type face:
Then try typing attach(sumcr)
, press Enter, and now type
WidthWS
on the next line (should work ok now).
11 Installing packages
R uses “packages” to add functionality to “base R”. Packages (or
“libraries” – the S-language term) may include combinations of R code,
Fortran and C. They must be installed, and then are loaded using the
library()
function. The lecture and task pages will
demonstrate how this works as we go on. In the meantime, a minimal set
of packages can be installed that will allow the initial lectures and
demonstrations of R to be reproduced. To install a number of packages at
once, copy and paste the following into the Console window, or copy into
an R script and run the code.
A lot of output will be produced as the packages are downloaded, unpacked, etc. If you get a meassge about using a private or local library, reply “yes”.
12. What to hand in.
Use the summary()
function to produce a quick
summarization of the data set:
To hand it in, simply copy-and-paste it into a Canvas email.
To print the summary out, select the text, and click on the “print” icon, or use File > Print.
13. Quitting RStudio
R does not automatically save any script files you may have created
or any updates that may have been made to .RData
, but there
are dialogs that should pop up when quitting RStudio. Quit RStudio using
the File > Quit Session… menu. A dialog box will pop up saying “Quit
R Session, Save workspace image to …” Click on “Save”, and likewise for
any .R
or .Rmd
scripts you may have
created.