NOTE: This page has been revised for Winter 2021, but may undergo further edits.
Note: Ordinarily, learning how to download and “import” files into R/RStudio is an important part of climbing R’s steepish learning curve. To make it easier to replicate the lectures and to play with the code, here is a workaround that will load all of the individual data sets that are used in the lectures.
First, make sure that after starting RStudio, that the working directory is indeed the one created while doing Exercise 1. To reiterate, the directories are as follows:
First, install the
raster packages (which are required by some of the objects in the workspace):
# install the sp and raster packages install.packages("sp", repos = "http://cran.r-project.org") install.packages("raster", repos = "http://cran.r-project.org")
Then, clear the existing workspace by typing or copying the following in the
Console pane of RStudio:
# clear the current workspace rm(list=ls(all=TRUE))
WARNING: This will indeed remove everything in the current workspace. That will be ok, unless you’re in the middle of an exercise. Then, enter the following in the
Console pane. The code uses a “connection” to download data from a URL:
# connect to a saved workspace name geog495.RData and load it <- url("https://pjbartlein.github.io/GeogDataAnalysis/data/Rdata/geog495.RData") con load(file=con) close(con)
Note that this workspace will overwrite the existing one. You can check its contents using
ls(), or by typing
summary(sumcr). This workspace contains the shape files that are used in the exercises, so if you load it, you won’t have to download and read in the individual shape file components. See Section 6 of
Packages and data on the course web page
If the workspace has downloaded correctly, then you can skip the code chunks that read
.csv files (e.g.
In describing or characterizing the observations of an individual variable, there are three basic properties that are of interest:
Univariate plots provide one way to find out about those properties (and univariate descriptive statistics provide another).
There are two basic kinds of univariate, or one-variable-at-a-time plots,
Enumerative plots, in which all observations are shown, have the advantage of not losing any specific information–the values of the individual observations can be retrieved from the plot. The disadvantage of such plots arises when there are a large number of observations–it may be difficult to get an overall view of the properties of a variable. Enumerative plots do a fairly good job of displaying the location, dispersion and distribution of a variable, but may not allow a clear comparison of variables, one to another.
Recall that your data folders were:
(Note that in these lecture pages, the paths may appear a little differently.)
cities.csv file, using an explict path:
# read a .csv file <- "/Users/bartlein/Documents/geog495/data/csv/cities.csv" csvfile <- read.csv(csvfile)cities
(What’s happening above is that first the string object
csvfile is created by assigning (i.e. using the assignment operator
<-, which is pronounced “gets”) the string
/Users/bartlein/geog495/data/csv/cities.csv (the path to the file) and then this object is operated on the the
read.csv() function, to create the object
cities.) Adjust the path to reflect your own setup, on a Mac, Windows, or VM.
Get some summary information about the
cities “data frame”. The
names() function lists the variables or attributes:
# get the names of the variables names(cities)
##  "City" "Area" "Pop.1980" "Pop.1990" "Pop.2000" "Growth" "Food" "PersRoom" "Water" ##  "Elec" "Phones" "Vehicles"
The “structure” function (
str()) provides a short listing of variables and few values:
## 'data.frame': 21 obs. of 12 variables: ## $ City : chr "Buenos Aires" "Dhaka" "Rio de Janeiro" "Sao Paolo" ... ## $ Area : num 700 100 650 800 1680 ... ## $ Pop.1980: int 9918 3290 8789 12101 9029 11739 7268 6852 8067 9030 ... ## $ Pop.1990: int 11448 6578 10948 18119 10867 13447 9249 8633 12223 10741 ... ## $ Pop.2000: int 12822 11511 12162 22552 14366 17407 12508 10761 18142 12675 ... ## $ Growth : num 1.4 7.2 2.2 4.1 1.9 ... ## $ Food : int 40 63 26 50 52 55 52 47 57 60 ... ## $ PersRoom: num 1.3 2.4 0.8 0.8 1.2 ... ## $ Water : int 80 60 86 100 88 95 80 91 92 51 ... ## $ Elec : int 91 85 98 100 90 95 84 98 78 63 ... ## $ Phones : int 14 2 8 16 2 4 4 4 5 2 ... ## $ Vehicles: int 1000 600 4000 4000 308 148 200 939 588 500 ...
tail() functions list the first and last lines:
## City Area Pop.1980 Pop.1990 Pop.2000 Growth Food PersRoom Water Elec Phones Vehicles ## 1 Buenos Aires 700 9918 11448 12822 1.4 40 1.3 80 91 14 1000 ## 2 Dhaka 100 3290 6578 11511 7.2 63 2.4 60 85 2 600 ## 3 Rio de Janeiro 650 8789 10948 12162 2.2 26 0.8 86 98 8 4000 ## 4 Sao Paolo 800 12101 18119 22552 4.1 50 0.8 100 100 16 4000 ## 5 Beijing 1680 9029 10867 14366 1.9 52 1.2 88 90 2 308 ## 6 Shanghai 630 11739 13447 17407 1.4 55 2.0 95 95 4 148
## City Area Pop.1980 Pop.1990 Pop.2000 Growth Food PersRoom Water Elec Phones Vehicles ## 16 Mexico City 250 13888 15085 16190 0.8 41 1.9 92 97 6 2500 ## 17 Lagos 40 4385 7742 13480 5.8 58 5.8 47 53 1 500 ## 18 Karachi 353 5023 7943 11895 4.7 43 3.3 66 84 2 650 ## 19 Manila 64 5966 8882 12582 4.1 38 3.0 89 93 9 510 ## 20 Los Angeles 1660 9523 11456 13151 1.9 12 0.5 91 98 35 8000 ## 21 New York 359 15601 16056 16645 0.3 16 0.5 99 100 56 1780
“Enumerative plots” are called such because they enumerate or show every individual data point (in contrast to “summary plots”.)
Displays the values of a single variable for each observation using symbols, with values of the variable for each observation plotted relative to the observation number
# use large cities data [cities.csv], get an index plot attach(cities) plot(Pop.2000)
(Note the use of the
attach() function. An individual variable’s “full” name is the name of the dataframe concatentated with its “short” name, with a dollar sign in between, e.g.
attach() function puts the data frame in the first search position and allows one to refer to a variable just by its short name (e.g.
Displays the values of a single variable plotted as thin vertical lines
# another univariate plot plot(Pop.2000, type="h")
A variety of different versions of the standard univariate plot generated with the
plot() function can be generated using the
type = "l", "b", "o", "s", or "S"
It’s good practice when done with a data set to detach it:
# detach the cities dataframe detach(cities)
When data are in some kind of order (as in time), index values contain some useful information. Read and attach the
specmap.csv file, and then plot the delta-O18 (oxygen isotope) values.
# use Specmap delta-O18 data <- "/Users/bartlein/Documents/geog495/data/csv/specmap.csv" csvfile <- read.csv(csvfile)specmap
# attach specmap and plot attach(specmap) plot(O18)
In this data set, the large negative values indicate warm/less-ice conditions, and so it would be more appropriate to plot the values on an inverted y-axis, using the
# inverted y-axis plot(O18, ylim=c(2.5,-2.5)) # invert y-axis
Displays the values of a single variable as symbols plotted along a line
# stripchart stripchart(O18)
stripchart(O18, method="stack") # stack points to reduce overlap
# detach the specmap dataframe detach(specmap)
The Cleveland dot plot displays the values of a single variable as symbols plotted along a line, with a separate line for each observation. (Note that we reattach the data set first.)
# dotchart attach(cities) dotchart(Pop.2000, labels=City)
An alternative version of this plot, and the one most frequently used, can be constructed by sorting the rows of the data table. Sorting can be tricky–it is easy to completely rearrange a data set by sorting one variable and not the others. It is often better to leave the data unsorted, and to use an auxiliary variable (in this case index) to record the rank-order of the variable being plotted (in this case Pop.2000), and the explicit vector-element indexing of R to arrange the data in the right order:
# indexed (sorted) dotchart <- order(Pop.2000) index dotchart(Pop.2000[index], labels=City[index])
This example shows how to index or refer to specific values of a variable by specifying the subscripts of the observations involved (in square brackets
Once you’re done with a data set, it’s good to “detach” it to avoid conflict among variables from different data sets that might have the same name.
# detach the cites dataframe detach(cities)
Summary plots display an object or a graph that gives a more concise expression of the location, dispersion, and distribution of a variable than an enumerative plot, but this comes at the expense of some loss of information: In a summary plot, it is no longer possible to retrieve the individual data value, but this loss is usually matched by the gain in understanding that results from the efficient representation of the data. Summary plots generally prove to be much better than the enumerative plots in revealing the distribution of the data.
Read the two data sets if they are not already in the workspace or environment. The
scanvote data set will be explained below.
# adjust the paths to reflect the local environment <- "/Users/bartlein/Documents/geog495/data/csv/specmap.csv" csvfile <- read.csv(csvfile) specmap <- "/Users/bartlein/Documents/geog495/data/csv/scanvote.csv" csvfile <- read.csv(csvfile)scanvote
Histograms are a type of bar chart that displays the counts or relative frequencies of values falling in different class intervals or ranges.
# use Specmap O-18 data [specmap.csv] attach(specmap) # histogram hist(O18)
The overall impression one gets about the distribution of a variable depends somewhat on the way the histogram is constructed: fewer bars give a more generalized view, but may obscure details of the distribution (the existence of a bimodal distribution, for example), while more may not generalize enough. Plot a second histogram with 20 bins using the
# second histogram hist(O18, breaks=20)
A density plot is a plot of the local relative frequency or density of points along the number line or x-axis of a plot. The local density is determined by summing the individual “kernel” densities for each point. Where points occur more frequently, this sum, and consequently the local density, will be greater. Density plots get around some of the problems that histograms have, but still require some choices to be made.
Density plot of the O18 data. Note that in this example, an object
O18.density is created by the
density() function, and then plotted using the
# density plot <- density(O18) O18_density plot(O18_density)
Plots with both a histogram and density line can be created:
# histogram plus density plot <- density(O18) O18_density hist(O18, breaks=40, probability=TRUE) lines(O18_density) rug(O18)
Note the addtion of the “rug” plot at the bottom (which is an enumerative plot).
# detach the specmap dataframe detach(specmap)
A boxplot characterizes the location, dispersion and distribution of a variable by construction a box-like figure with a set of lines (whiskers) extending from the ends of the box. The edges of the box are drawn at the 25th and 75th percentiles of the data, and a line in the middle of the box marks the 50th percentile. The whiskers and other aspects of the boxplot are drawn in various ways.
# use Scandinavian EU-preference vote data attach(scanvote) # get a boxplot boxplot(Pop)
Note that the plot looks pretty odd, a function of the distribution of the population data. Typically, the log (base 10) of population data is more informative:
# second boxplot boxplot(log10(Pop))
There are a number of “theoretical” reference distributions that arise in data analysis that can be compared with observed or empirical distributions (i.e. of a set of observations of a particular variable) and used in other ways. One of the more frequently used reference distributions is the normal distribution (which arises frequently in practice owing to the Central Limit Theorem).
Theoretical distributions are represented by their:
For the standard normal distribution (with mean of 0 and a standard deviation of 1), the PDF and CDF can be displayed as follows:
# display the "normal" theoretical reference distribution <- seq(-3.0,3.0,.05) z <- dnorm(z) # get probability density function pdf_z plot(z, pdf_z)
<- pnorm(z) # get cumulative distribution function cdf_z plot(z, cdf_z)
and the inverse cumulative distribution function as follows:
# inverse cdf <- seq(0,1,.01) p <- qnorm(p) invcdf_z plot(p,invcdf_z)
A quantile plot is a two-dimensional graph where each observation is shown by a point, so strictly speaking, a QQ plot is an enumerative plot. The data value for each point is plotted along the vertical or y-axis, while the equivalent quantile (e.g. a percentile) value is plotted along the horizontal or x-axis. The quantiles plotted along the x-axis could be empirical ones, like the percentile equivalents or rank for each value, or they could be theoretical ones corresponding to the “p-values” of a reference distribution (e.g. a normal distribution) with the same parameters as the variable being examined. In practice, the shape of the QQ plot is the issue:
The qqnorm plot plots the data values along the y-axis, and p-values of the normal distribution along the x-axis. qqline adds a straight line that passes through the first and third quartiles (25th and 75th percentiles) and can be used to assess (a) the overall departure of the observed distribution from a normal distribution with the same parameters (mean and standard deviation) as the observations, and (b) outliers or unsual points.
# QQ plots qqnorm(Pop) qqline(Pop)