There is a broad group of multivariate analyses that have as their objective the organization of individual observations (objects, sites, grid points, individuals), and these analyses are built upon the concept of multivariate distances (expressed either as similarities or dissimilarities) among the objects.
The organization generally takes two forms:
These analyses share many concepts and techniques (both numerical and practical) with other procedures such as principal components analysis, numerical taxonomy, discriminant analysis and so on.
The analyses generally begin with the construction of an n x n matrix D of the distances between objects. For example, in a two-dimensional space, the elements dij of D could be the Euclidean distances between points,
dij = [(xi1- xj1)2 + (xi1- xj1)2]½
The Euclidean distance, and related measures are easily generalized to more than two dimensions.
The basic idea of cluster analysis can be illustrated using the Oregon climate-station dataset, with the goal being to identify groups of stations with similar climates, which also regionalizes the climate of the state. Begin by reading a shape file for mapping the data, and the data itself.
# read county outlines
library(sp)
library(rgdal)
otl_path <- "/Users/bartlein/Projects/ESSD/data/shp_files/OR/"
otl_name <- "orotl.shp"
otl_file <- paste(otl_path, otl_name, sep="")
orotl_shp <- readOGR(otl_file)
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/bartlein/Projects/ESSD/data/shp_files/OR/orotl.shp", layer: "orotl"
## with 36 features
## It has 1 fields
# read stations
orstationc_path <- "/Users/bartlein/Projects/ESSD/data/csv_files/"
orstationc_name <- "orstationc.csv"
orstationc_file <- paste(orstationc_path, orstationc_name, sep="")
orstationc <- read.csv(orstationc_file)
summary(orstationc)
## station lat lon elev tjan tjul
## MLB : 2 Min. :42.05 Min. :-124.6 Min. : 2.00 Min. :-6.000 Min. :12.20
## ANT : 1 1st Qu.:43.60 1st Qu.:-123.2 1st Qu.: 94.25 1st Qu.:-1.225 1st Qu.:17.30
## ARL : 1 Median :44.46 Median :-121.7 Median : 462.00 Median : 0.750 Median :19.20
## ASH : 1 Mean :44.32 Mean :-121.3 Mean : 556.99 Mean : 1.417 Mean :19.01
## AST : 1 3rd Qu.:45.28 3rd Qu.:-119.6 3rd Qu.: 863.75 3rd Qu.: 4.150 3rd Qu.:20.20
## BAN : 1 Max. :46.15 Max. :-117.0 Max. :1974.00 Max. : 8.400 Max. :26.40
## (Other):85
## tann pjan pjul pann idnum
## Min. : 3.200 Min. : 14.00 Min. : 4.00 Min. : 198.0 Min. :350197
## 1st Qu.: 8.700 1st Qu.: 39.75 1st Qu.: 7.00 1st Qu.: 299.0 1st Qu.:352308
## Median :10.400 Median : 87.50 Median : 9.00 Median : 531.0 Median :354564
## Mean : 9.885 Mean :141.67 Mean :10.84 Mean : 868.6 Mean :354455
## 3rd Qu.:11.200 3rd Qu.:239.25 3rd Qu.:12.25 3rd Qu.:1355.8 3rd Qu.:356663
## Max. :12.600 Max. :414.00 Max. :34.00 Max. :2517.0 Max. :359316
##
## Name
## ANTELOPE 1 N USA-OR : 1
## ARLINGTON USA-OR : 1
## ASHLAND 1 N USA-OR : 1
## ASTORIA WSO //R USA-O: 1
## BAKER FAA AIRPORT USA-OR : 1
## BAKER KBKR USA-OR : 1
## (Other) :86
Plot the station locations.
# plot stations
plot(orotl_shp)
points(lon, lat, xlim=c(-125, -116), ylim=c(41,47),
xlab="Latitude", ylab="Longitude", type="n")
text(lon, lat, labels=station, cex=0.6)
In this example, the groups of stations (or the climate regions of Oregon, as expressed in the climate-station data) will be defined via an “agglomerative-hierarchical” cluster analysis. In such an analysis, individual objects (observations) are combined according to their (dis)similarity, with the two most- similar objects being combined first (to create a new composite object), then the next two most similar objects (including the composites), and so on until a single object is defined. The sequence of combinations is illustrated using a “dendrogram” (cluster tree), and this can be examined to determine the number of clusters. The function cutree determines the cluster membership of each observation, which can be listed or plotted.
The first step in an agglomerative cluster analysis (although not always expicitly displayed), is the calculation of a dissimilarity matrix, which is a square symmetrical matrix that illustrates the dissimilarities (typically Euclidean distances) between individual cases or observations, using, in this case the six climate variables to define the space that distance is calculated in.
# get dissimilarities (distances) and cluster
library(lattice)
X <- as.matrix(orstationc[,5:10])
X_dist <- dist(scale(X))
grid <- expand.grid(obs1 = seq(1:92), obs2 = seq(1:92))
levelplot(as.matrix(X_dist)/max(X_dist) ~ obs1*obs2, data=grid,
col.regions = gray(seq(0,1,by=0.0625)), xlab="observation", ylab="observation")
In the code above, for convience, a matrix X was created to omit the variables we don’t want to include in the clustering (i.e. the non-climatic information). The levelplot() function plots the dissimilarity matrix that the cluster analysis works with. The “plaid” as opposed to completely random appearance of the plot indicates that it is likely that natural groups will exist in the data. The dark diagonal line is formed from the Eucldean distance values of 0.0 from the comparison of each observation with itself.
Now do the cluster analysis:
# cluster analysis
X_clusters <- hclust(X_dist, method = "ward.D2")
plot(X_clusters, labels=station, cex=0.5)
The function hclust()
performs a “Wards’s single-linkage” analysis, and the plot()
function plots the dendrogram. Notice that the individual “leaves” of the plot are labeled by the climate-station names.
The dendrogram can be inspected to get an idea of how many distinct clusters there are (and it can also identify unusual points–look for Crater Lake CLK
). As a first guess, let’s say there were three distinct clusters evident in the dendrogram. The cuttree()
function can be used to “cut” the dendrogram at a particular level:
# cut dendrogram and plot points in each cluster
nclust <- 3
clusternum <- cutree(X_clusters, k=nclust)
And then the cluster membership (i.e. which cluster?) of each station can be mapped.
# plot cluster membership of each station
plot(lon, lat, xlim=c(-125, -116), ylim=c(41,47), xlab="Latitude", ylab="Longitude",
main=paste(as.character(nclust), "Clusters"), type="n")
plot(orotl_shp, add=T)
text(lon, lat, labels=clusternum, col="red")
(A listing of the cluster number that each station belongs to can be gotten by the following code.)
The efficacy of the analyis in regionalizing or grouping the stations can be gauged usigns plots and a analysis that looks at the “separation” of the groups of stations.
## 1 2 3
## -1.466667 4.000000 5.305263
Note how many of the histograms seem multimodal–this is a clear sign of inhomogeneity of the clusters. Let’s see how the clusters differ. This can be done via discriminant analysis, which (like PCA) defines some new variables with desireable properties, in this case, new variables that maximally “spread out” the groups along the new variable axes:
# discriminant analysis (chunk 3)
library(MASS)
cluster_lda1 <- lda(clusternum ~ tjan + tjul + tann + pjan + pjul + pann)
cluster_lda1
## Call:
## lda(clusternum ~ tjan + tjul + tann + pjan + pjul + pann)
##
## Prior probabilities of groups:
## 1 2 3
## 0.5217391 0.2717391 0.2065217
##
## Group means:
## tjan tjul tann pjan pjul pann
## 1 -1.466667 20.03958 8.835417 49.04167 9.00000 353.2708
## 2 4.000000 19.36000 11.312000 186.64000 7.88000 1080.1600
## 3 5.305263 15.95263 10.657895 316.52632 19.36842 1891.8947
##
## Coefficients of linear discriminants:
## LD1 LD2
## tjan 0.106062251 -1.357953404
## tjul -0.213848899 -1.229994410
## tann 0.521178594 2.210153884
## pjan 0.010726336 -0.009918392
## pjul 0.023851475 0.112201911
## pann 0.001274417 0.001494494
##
## Proportion of trace:
## LD1 LD2
## 0.937 0.063
cluster_dscore <- predict(cluster_lda1, dimen=nclust)$x
for (j in 1:(nclust-1)) {
boxplot(cluster_dscore[,j] ~ clusternum, xlab=paste("LD", as.character(j), sep=""))
}
## LD1 LD2
## tjan 0.9112303 -0.30450608
## tjul -0.5352598 -0.40124030
## tann 0.5313206 -0.51027660
## pjan 0.9720352 0.09629818
## pjul 0.5199545 0.73301330
## pann 0.9649685 0.16285890
The discriminant analysis here is used to answer the question “how do the clusters of stations differ climatically?” In this case, it looks like pann
and tann
are the variables most closely correlated with each discriminant function, and because each of these variables are more-or-less averages of the seasonal extreme variables, that might explain why the clusters seem inhomogeneous.
The analysis can be repeated, extracting different numbers of clusters. The easiest way to do this is to copy the three chunks of code above into a text editor, changing the assignment to nclust and including or excluding before executing each chunk, one at a time.
As noted above, distance matrices are interesting in their own right, as well as being input to a cluster analysis. In a typical rectangular data set, there are two distance matrices that could be calculated, one comparing variables (i.e. the columns in a data frame), and the other comparing observations (or rows). In the examples below the "TraCE 21ka decadal data are used. There are (96 x 48) 4608 grid points and 2204 observations. Read the raster brick in the netCDF file (Trace21_TREFHT_anm2.nc
), and reshape it the usual way:
# reshape the 3-d array
var_vec_long <- as.vector(var_array)
length(var_vec_long)
X <- t(matrix(var_vec_long, nrow = nlon * nlat, ncol = nt))
dim(X)
Notice the use of the transposition function (t()
) to get the data in the proper “tidy” arrangement for a data frame (variables in columns, observations in rows). Distance matrix calcualation can take a while, but the parallelDist
package can be used to greatly speed things up, by handing off the creation of different blocks of the matrix to differnt processors in multi-processor machines.
The parDist()
function gets the dissimilarity martrix:
The dissimilarity martrix can be conveniently visualized by turning it into a raster, and then using the raster plot()
function to visualize it.
raster_obs <- flip(raster(dist_obs, xmn=0, ymn=0, xmx=dim(dist_obs)[1], ymx=dim(dist_obs)[2]), direction=1)
raster_obs
Note the use of the maxpixels
argument to avoid smoothing in the image. The rendering of the image is faster if the image is written to a file:
png(dist_obs_pngfile, width=720, height=720) # open the file
plot(raster_obs, maxpixels=dim(dist_obs)[1]*dim(dist_obs)[2], col=rev(brewer.pal(9,"RdPu")))
dev.off()
And here’s the image:
The image is sort of a map: The grid points are arranged in the data frame with longitudes varying rapidly and the sqrare dark area (low dissimilarities) in the middle is the tropics, where the spatial variability of climate is low, while the more spatially variable higher-latitude regions (at all longitudes) frame the low-variability region.
The dissimilarity matrix for variables is gotten in a similar fashion, but not the use of the transpose function t()
to get the variables in rows, and observations in columns:
# variables (grid points)
dist_var <- as.matrix(parDist(t(X)))
dim(dist_var)
raster_var <- flip(raster(dist_var, xmn=0, ymn=0, xmx=dim(dist_var)[1], ymx=dim(dist_var)[2]), direction=1)
raster_var
png(dist_var_pngfile, width=720, height=720) # open the file
plot(raster_var, maxpixels=dim(dist_var)[1]*dim(dist_var)[2], col=rev(brewer.pal(9,"RdPu")))
dev.off()
Here the temporal variability of is emphasized.
“Heatmaps” are a visualization approach that is implemented in the base
package of R. A heatmap basically consists of the raster-image type plot of the data frame, with the rows and columns of the matrix reaggranged using cluster analyses of the varables and observations. (Note that the cluster analyses are carried out individually, as oppsosed to simultaneously, an approach known as “biclustering”.) Typically, for small data sets, the matrix plot is augmented by dendrograms, and by “side bars” which plot some additional attribute of the data (location or time, for example) using a scatter plot, gradient plot, etc.
A simple heatmap can be generated for a “Hovmöller matrix” calculated using the gridded data. The Hovmöller matrix contains longitudinal averages and is laid out with one row for each latitude and one column for each time.
Create the Hovmöller matrix:
# Hovmöller matrix
X <- matrix(0, nrow=nlat, ncol=nt)
for (n in (1:nt)) {
for (k in (1:nlat)) {
for (j in (1:nlon)) {
X[k, n] <- X[k, n] + var_array[j, k, n]
}
X[k, n] <- X[k,n ]/nlon
}
}
dim(X)
ncols <- dim(X)[2]
Set row and column names for the matrix.
# set row and column names
rownames(X) <- str_pad(sprintf("%.3f", lat), 5, pad="0")
head(rownames(X));tail(rownames(X)); length(rownames(X))
colnames(X) <- str_pad(sprintf("%.3f", plt_xvals), 5, pad="0")
head(colnames(X));tail(colnames(X)); length(colnames(X))
A matrix plot of the Hovmöller matrix will illustrate the data (and is an actual Hovmöller plot). To get high (northern) latitudes at the top of the plot, and high (southern) latitudes at the bottom, flip the data:
For this example, the data are decimated, by taking every 20th observation (to allow the dendrograms to be visualized):
More set up, Generate colors for the matrix pot and the side bars, which here will illustrate the latitude and time of each data point.
Produce the heatmap for the Hovmöller matrix:
png_file <- paste(file_label, "_heatmap_hov_01", ".png", sep="")
png(png_file, width=960, height=960)
system.time( X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv=NA, Colv=NA, dendrogram = "none", # just plot data, don't do clustering
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
)
dev.off()
Notice the way the side bars work in illustrating the location and time of each data value.
Now here’s the heatmap:
png_file <- paste(file_label, "_heatmap_01", ".png", sep="")
png(png_file, width=960, height=960)
system.time( X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none",
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
)
dev.off()
The heatmap has rearranged the data according to the cluster analyses of the observations and variables, placing more similar data points next to one another. Notice how the side bars allow one to figure out the location and time of individual points. Here, and not surprisingly, there appear to be four distinct time-space clusters (or “regimes”) in the data): glacial vs. interglacial (Holocene) and high latitudes (in both hemispheres) vs. low latitudes.
A second example is given using the decadal data directly. Here, there are too many points to make the dendrograms interpretable in any way, and so they are suppressed.
png_file <- paste(file_label, "_heatmap_hov_01", ".png", sep="")
png(png_file, width=960, height=960)
system.time( X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none", Rowv=NA, Colv=NA, dendrogram = "none",
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
)
dev.off()
png_file <- paste(file_label, "_heatmap_01", ".png", sep="")
png(png_file, width=960, height=960)
system.time( X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none", # Rowv=NA, Colv=NA, dendrogram = "none",
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Latitude", key = TRUE, key.title = NA)
)
dev.off()
Finally, a heatmap of the gridded data (as opposed to the Hovmöller matrix data) can be obtained
There are several set up steps necessary. Reshape the array, and generate row and column names:
# reshape the 3-d array
var_vec_long <- as.vector(var_array)
length(var_vec_long)
X <- matrix(var_vec_long, nrow = nlon * nlat, ncol = nt) # Don't transform as usual..
dim(X)
ncols <- dim(x)[3]
# generaate row and column names
rownames(X) <- paste("E", str_pad(sprintf("%.2f", round(grid$lon, 2)), 5, pad="0"),
"N", str_pad(sprintf("%.2f", round(grid$lat, 2)), 5, pad="0"), sep="")
head(rownames(X), 20); tail(rownames(X), 20); length(rownames(X))
colnames(X) <- str_pad(sprintf("%.3f", t), 5, pad="0")
head(colnames(X)); tail(colnames(X)); length(colnames(X))
Flip the data…
… and generate colors:
Here’s a matrix plot of the data:
png_file <- paste(file_label, "_heatmap_hov_03", ".png", sep="")
png(png_file, width=960, height=960)
system.time( X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none",
Rowv=NA, Colv=NA, dendrogram = "none",
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Locations", key = TRUE, key.title = NA, labRow = FALSE, labCol = FALSE)
)
dev.off()
## user system elapsed
## 1700.08 42.51 1744.53
Heatmap of the grid-point data
png_file <- paste(file_label, "_heatmap_03", ".png", sep="")
png(png_file, width=960, height=960)
system.time( X_heatmap <- heatmap.2(X, cexRow = 0.8, cexCol = 0.8, scale="none", dendrogram = "none",
RowSideColors=rcol, ColSideColors=ccol, col=zcol, breaks=breaks, trace="none", density.info = "none",
lmat=rbind(c(6, 0, 5), c(0, 0, 2), c(4, 1, 3)), lwid=c(1.0, 0.2, 5.0), lhei = c(1.5, 0.2, 4.0),
xlab = "ka", ylab = "Locations", key = TRUE, key.title = NA, labRow = FALSE, labCol = FALSE)
)
## user system elapsed
## 643.89 42.26 688.10