Exercise 8

NOTE: This page has been revised for Winter 2021, but may undergo further edits.

Geography 4/595: Geographic Data Analysis
Winter 2021

Exercise 8: Multivariate Analysis
Finish by Friday, March 12

1. Introduction

The data set used for most of this exercise consists of a subset of the variables and observations in a data set from Drew Lamb’s thesis, observed for 120 streams in northeastern Oregon. The variables are defined as follows:

1) Stream – stream name                            8) SmLWD – no. small large woody debris ( LWD) per mile
2) DA –  drainage area                             9) LrgLWD – no. large woody debris (LWD) per mile
3) Health – stream “health” (factor)              10) PoolDepth – depth of pools                
4) ReachLen – length of the reach                 11) PoolArea – area of habitat units that are pools
5) Pools – number of pools per mile of stream     12) WDriff – width-to-depth ratio in pools  
6) DeepPools – number of deep pools per mile      13) PRratio – pool-to-riffle ratio  
7) LargePools – number of large pools per mile    14) BFWDratio -- bankfull width-to-depth ratio

There are two version of the data set: 1) the original values [streams4.csv] and 2) transformed values [tstreams4.csv]. The exercise will focus on the latter.

Read the .csv file the usual way, and name the resulting object tstreams. If you’ve loaded the geog495.RData workspace, tstreams should already be there.

NOTE: The variable names are the same in the two data sets, and so they both cannot be attached at the same time.

2. Multivariate plots

To start, download the .csv files to your working directory and read them into your workspace, if they’re not already there

# read the streams file (if not already in the workspace)
csv_path <- "/Users/bartlein/Documents/geog495/data/csv/streams"
streams <- read.csv(csv_path)

# read the transformed streams file (if not already in the workspace)
csv_path <- "/Users/bartlein/Documents/geog495/data/csv/tstreams"
tstreams <- read.csv(csv_path)

Get matrix scatter diagrams of the two data sets:

plot(streams[3:13], pch=16, cex=.5)
plot(tstreams[3:13], pch=16, cex=.5)

A few features should emerge, including the fact that for some variables, many zeros are observed (i.e. the count or frequency variables), and that some values for BFWDratio have evidently been estimated using drainage area. (How can you tell?) Notice that for some relationships, like that between PoolDepth and PoolArea the relationship is clearer in the transformed data set. Overall the transformed data set exhibits stronger linear relationships among variables, and so this one will be used below.

3. Principal Components Analysis (PCA)

Begin by doing a simple principal components analysis on variables 3 through 13 (i.e. just the continuous variables (columns 3 through 13), and not the factor Health). Select the variables to be analyzed, and add rownames to the resulting matrix.

# principal components analysis of (transformed) NE Oregon streams data
tstreams_matrix <- data.matrix(tstreams[,3:13])
# rownames combining stream name and U/H
rownames(tstreams_matrix) <- paste(as.character(tstreams[,2]), as.character(tstreams[,1]))

Use the psych package PCA function (principal()) to get the principal components.

library(psych)
tstreams_pca <- principal(tstreams_matrix, nfactors=4, rotate="none")

Now take a look at the results:

# display the results
tstreams_pca
summary(tstreams_pca)
VSS.scree(tstreams_matrix)
print(loadings(tstreams_pca),cutoff=0.0)
biplot(tstreams_pca$scores[,1:2], loadings(tstreams_pca)[,1:2],
  main="PCA Biplot -- 2 Components", cex=0.5)

Typing the name of the resulting PCA object (tstreams_pca) along with the application of summary() function produces information on the relative importance of the principal components. Important components are generally regarded as those with variances (eigenvalues) greater than 1.0. Printing the results of the application of the loadings() function produces a table that contains the correlations between the original variables and the new components, while the screeplot() function gives a graphic picture of the importance of the first few components.

The biplot() function shows both the observations (labeled by row number here) and the variables (represented by vectors) on the same display (i.e., a biplot).

The components of the biplot are generated this way: The “component scores” or the values of the first two new components are first plotted as a labeled scatter diagram, where the labels above are the rownames. (Other things could be plotted, like the observation number.) Here is how the biplot could be generated manually:

# biplot -- observations
plot(tstreams_pca$scores[,1:2], type="n")
text(tstreams_pca$scores[,1:2], labels=seq(1:length(tstreams[,1])))

Or another way:

# biplot -- observations
plot(tstreams_pca$scores[,1:2], cex = 0.2)
text(tstreams_pca$scores[,1:2], labels = rownames(tstreams_matrix), cex = 0.5)

Then the vectors are drawn by overlaying a scatter plot of the loadings of the first two components:

# biplot -- variables
plot(loadings(tstreams_pca)[,1:2])
text(loadings(tstreams_pca)[,1:2], labels=colnames(tstreams_matrix))

The network plot provided by the qgraph package can also be used to visualize the relationship between the components and the original variables. First, we’ll need a function to implement the two-stage construction of such a plot.

Load the qgraph library:

library(qgraph)

Here is the function:

qgraph_loadings_plot <- function(loadings_in, title) {
  ld <- loadings(loadings_in)
  qg_pca <- qgraph(ld, title=title, 
                   posCol = "darkgreen", negCol = "darkmagenta", arrows = FALSE, vTrans=256)
  qgraph(qg_pca, title=title,
         layout = "spring", 
         posCol = "darkgreen", negCol = "darkmagenta", arrows = FALSE,
         node.height=0.5, node.width=0.5, vTrans=256, edge.width=0.75, label.cex=1.0,
         width=7, height=5, normalize=TRUE, edge.width=0.75)
}

Copy the code, and either paste it into a script file and the “source” (i.e. run) it as usual, by selecting it and clicking on Run, or simply paste into the Console window at the command line.

The function has two arguments: 1) the object created by principal() (e.g. tstreams_pca), and 2) a title. Get the plots:

qgraph_loadings_plot(tstreams_pca, "PCA tstreams, unrotated components")

The qgraph() function generates a plain plot of the loadings, where the component (loadings) are represented by the numbered circles, the variables by squares labeled by abbreviations of the variable names, and the strength and sign of the loadings by colored links (magenta = negative; green = positive; and with the width of the arrow scaled to represent the magnitude of the loadings). The first plot simply arranges the components and variables around an ellipse. A more useful version of the plot can be generated using the layout="spring" argument, which forces the line segments representing the loadings to be scaled inversely proportionally to the magnitude of he loadings (shorter segments represent higher loadings).

Q1: How many important components are there? (See the screeplot.) Is there a natural breakpoint between the important and less-important components? Are there some variables on the qgraph() plot that are related only to a single component.

Q2: Does the biplot clearly reveal the “structure” in the data (i.e. groups of vectors at right-angles to one another), or is it ambiguous (vectors aimed all over the place)? Are the locations of the individual (numbered) observations on the biplot consistent with your interpretation of the components? (You’ll have to look at the data in some way to answer the last part.)

4. Simple Factor Analysis

Next, do a simple (unrotated) factor analysis of the same variables:

# factor analysis:  extract 4 factors, no rotation
tstreams_fa1 <- factanal(tstreams_matrix, factors=4, rotation="none",
  scores="regression")
tstreams_fa1 # list the results
print(loadings(tstreams_fa1),cutoff=0.7)
biplot(tstreams_fa1$scores[,1:2], loadings(tstreams_fa1)[,1:2],
  main="PCA/Factor Analysis Biplot -- 4 Components", cex=0.5)

And here’s the qgraph plot:

qgraph_loadings_plot(tstreams_fa1, "Factor Analysis tstreams, unrotated components")

Q3: Compare the results of the factor analysis to the principal components analysis. What looks similar, what looks different? Hint: Focus on the the two biplots, and the relative position of the vectors and points as opposed to their absolute positions on the diagrams.

5. Rotated Factors

Next, we’ll rotate the factors to try to make them more interpretable. It looks like the fourth component is not really important, so drop it.

# factor analysis with rotation
tstreams_fa2 <- factanal(tstreams_matrix, factors=3, rotation="varimax",
  scores="regression")
tstreams_fa2
print(loadings(tstreams_fa2),cutoff=0.7)
biplot(tstreams_fa2$scores[,1:2], loadings(tstreams_fa2)[,1:2],
  main="PCA/Factor Analysis Biplot -- 3 Rotated Components", cex=0.5)

qgraph_loadings_plot(tstreams_fa2, "Factor Analysis tstreams, 3 rotated components")

Q4: How do the loadings, biplot and qgraph change? Which result, the unrotated or the rotated factors seems more interpretable? Hint: a more complicated analysis might not be better, but then again, it might be.

Q5: What are the basic “dimensions” of the data? From the perspective of somebody administering a field measurement and monitoring program, are there any redundancies among the variables that might be eliminated?

6. Multivariate Analysis of Variance (MANOVA)

The status of the streams were described by the land managers in terms of their perceived “ecological health” this is coded here by the variable Health. A multivariate analysis of variance is aimed at answering the question “are the healthy (Health=H) and”less-healthy" (Health=U) groups of reaches similar in their fluvial geomorphic characteristics?

Take a look at some plots first, to get a subjective idea of how the groups of observations differ (histogram() is a lattice function, so load the lattice library first). Also attach streams (up to this point variables have been implicitly referred to as columns in data matrices).

library(lattice)
# attach streams
attach(streams)

Here’s set of histograms for PoolDepth, conditioned on Health.

# plot distributions by groups
histogram(~ PoolDepth | Health, nint=20, layout=c(1,2), aspect=1/2)

The nint, layout and aspect function arguments arrange a couple of nice histograms, plotted one above the other in order to make comparison of the groups easier than if the histograms were plotted side-by-side. Look at a number of variables this way to get an idea of what is going on in the data.

Next, take a look at a univariate analyses of variance (ANOVA), to look at the significance of the differences in means between groups. For one of the variables (e.g. PoolDepth):

# one-way analysis of variance

# test for homogeneity of group variances
tapply(PoolDepth, Health, var)
bartlett.test(PoolDepth ~ Health)

# analysis of variance
streams_aov1 <- aov(PoolDepth ~ Health)
summary(streams_aov1)

(Here’s a link to the guide for interpreting p-values. [interpstats.pdf])

The test for homogeneity of group variances is done first, because if the variances are significantly different (i.e. the p-value for the Bartlett test is close to zero), then it really doesn’t make sense to ask whether the means are different. In the analysis of variance, large F statistics (and consequently small p-values) signal significant differences between (or among) groups.

You could examine each variable in the data set this way, and it’s likely that some variables will appear to have means that are significantly different between groups, others that aren’t, and still others that may be borderline, and so it may be difficult to get and overall sense of whether the two groups of streams are different. Also, it’s likely that if enough pairwise comparisons are done, some “significant” results (1 in 20) will turn up even if the groups of observations have identical means.

Multivariate analysis of variance (MANOVA) provides a single-number test statistic that can be used to answer the question “overall, are the group means significantly different?”

# MANOVA
Y <- cbind(DA, ReachLen, Pools, DeepPools, LargePools, SmLWD, LrgLWD,
  PoolDepth, PoolArea, PRratio, BFWDratio)
streams_mva1 <- manova(Y ~ Health)
summary.aov(streams_mva1)
summary(streams_mva1, test="Wilks")

Note the creation of a new temporary variable (Y) that makes it efficient to apply the manova()function. Wilks lambda statistic can be interpreted as a multivariate generalization of the univariate F statistic. The summary.aov() function creates most of the output – individual ANOVAs of for the variables in the data set. The summary() function applied to a manova() object produces the Wilk’s lambda statistic.

Q6: Do the groups of observations differ significantly?

7. Discriminant Analysis

A discriminant analysis focuses on the issue of how two (or more) groups of observations differ (and also includes the sort of information provided by an analysis of variance on whether the groups are different). Whereas MANOVA answers the question “are the groups different?” discriminant analysis answers the question "how are they different.)

Do a simple discriminant analysis on these data, using the lda() function from the Venebles and Ripley MASS library:

# linear discriminant analysis
library(MASS)

health_lda1 <- lda(Health ~ DA + ReachLen + Pools + DeepPools + LargePools
    + SmLWD + LrgLWD + PoolDepth + PoolArea + PRratio + BFWDratio)

Note the formula, Health depends on the other variables. List and plot the results.

health_lda1
plot(health_lda1)

Because there are only two groups of observations here, there will be only one discriminant function.

health_dscore <- predict(health_lda1, dimen=1)$x
boxplot(health_dscore ~ Health)
cor(tstreams[3:13],health_dscore)

The predict(...)$x function gets the “discriminant scores” (the x's) which are assigned to the variable health_dscore, the boxplot() function plots them by group (and should show clear variations of the discriminant scores among groups if the groups are distinctly different), while the cor() function gets the correlations between the new “discrimimant function” and the original variables (called “canonical coefficients”), which may be interpreted like PCA loadings. These values show which variables are most closely correlated with the discriminant scores, and consequently which variables best illustrate the differences between or among the groups.

Examine the ability of the new discriminant function to correctly classify the healthiness of the stream reaches.

health_class <- predict(health_lda1, dimen=1)$class
class_table <- table(Health, health_class)
mosaicplot(class_table, color=T)
detach(streams)

The predict(...)$class function gets the group that the new discriminant function would assign each original observation to based on the values of the individual variables, the table() function provides a cross tabulation of the observed healthiness of the streams (Health), and that predicted by the discriminant function (health_class). The mosaic() function plots the table.

Q7: How do the two groups of observations differ? Would knowing the values of a few variables allow you to correctly classify the healthiness of streams for which stream health has yet to be measured in the field?

You can detach the streams data frame at this point.

8. Cluster analysis

Attempt to define the climate regions of Oregon (as expressed in the climate-station data in [orstationc.csv] via an “agglomerative-hierarchical” cluster analysis. In such an analysis, individual objects (observations) are combined according to their (dis)similarity, with the two most similar objects being combined first (to create a new composite object), then the next most similar objects, and so on until a single object is defined. The sequence of combinations is illustrated using a “dendrogram” (cluster tree), and this can be examined to determine the number of clusters. The function cutree determines the cluster membership of each observation, which can be listed or plotted.

Begin with a map, labeled by station name (see Exercise 7 for the shapefile components, or get them from the [data sets and shapefiles] web page.) The shapefile may also already be in your workspace.

# hierarchical clustering of Oregon climate station data
library(sf)
library(lattice)
attach(orstationc)

Plot the station abbreviations.

# plot station abbreviations
plot(st_geometry(orotl_sf), axes=TRUE)
text(lon, lat, labels=station, cex=0.7, col="red")
title("Oregon Climate Stations")

Calculate a dissimilarity (distance) matrix:

# get dissimilarities (distances) and cluster
X <- as.matrix(orstationc[,5:10])
X_dist <- dist(scale(X))
grid <- expand.grid(obs1 = seq(1:92), obs2 = seq(1:92))
levelplot(as.matrix(X_dist)/max(X_dist) ~ obs1*obs2, data=grid, 
    col.regions = gray(seq(0,1,by=0.0625)), xlab="observation", ylab="observation")

For convenience, a matrix, X, is created to omit the variables we don’t want to include in the clustering (i.e. the non-climatic information). The levelplot() function plots the dissimilarity matrix that the cluster analysis works with. Now do the cluster analysis:

# cluster analysis
X_clusters <- hclust(X_dist, method = "ward.D2")
plot(X_clusters, labels=station, cex=0.5)

The plot() function (or pltclust() function) plots the dendrogram.

The dendrogram can be inspected to get an idea of how many distinct clusters there are (and it can also identify unusual points–look for Crater Lake CLK). As a first guess, let’s say there were three distinct clusters evident in the dendrogram.

# cut dendrogram and plot points in each cluster (chunk 1)
nclust <- 3
clusternum <- cutree(X_clusters, k=nclust)
head(cbind(station, clusternum, as.character(Name)))
tail(cbind(station, clusternum, as.character(Name)))

Plot the cluster membership.

# plot cluster membership of each station
plot(st_geometry(orotl_sf), axes=TRUE)
text(lon, lat, labels=clusternum, col="red")
title("Oregon Climate Stations -- 3 clusters")

The cutree() function determines the cluster membership of each observation, and a map of the cluster membership of each station is generated. A list of the station id’s, the cluster number and name of the stations is provided by the print() function

Just looking at the map suggests that three clusters might do an ok job of assigning the stations to groups that have similar climates, but this should be checked.

# examine clusters -- simple plots (chunk 2)
tapply(tjan, clusternum, mean)
histogram(~ tjan | clusternum, nint=20, layout=c(1,3))

Note how many of the histograms seem multimodal–this is a clear sign of inhomogeneity of the clusters. Let’s see how the clusters differ.

# discriminant analysis (chunk 3)
library(MASS)
cluster_lda1 <- lda(clusternum ~ tjan + tjul + tann + pjan + pjul + pann)
cluster_lda1
plot(cluster_lda1)

cluster_dscore <- predict(cluster_lda1, dimen=nclust)$x
for (j in 1:(nclust-1)) {
  boxplot(cluster_dscore[,j] ~ clusternum, xlab=paste("LD", as.character(j), sep=""))
  }
cor(orstationc[5:10],cluster_dscore)

The discriminant analysis here is used to answer the question “how do the clusters of stations differ climatically?” In this case, it looks like pann and tann are the variables most closely correlated with each discriminant function, and because each of these variables are more-or-less averages of the seasonal extreme variables, that might explain why the clusters seem inhomogeneous.

The analysis can be repeated, extracting different numbers of clusters.

Q8: In this first iteration, 3 clusters were identified. Try varying this number between, say, 2 and 8 clusters. Describe the trade-off between the number of clusters and the homogeneity of the clusters. (This is the classical “regionalization” problem in geography.)

That’s it.