1 Introduction

There is a broad group of multivariate analyses that have as their objective the organization of individual observations (objects, sites, grid points, individuals), and these analyses are built upon the concept of multivariate distances (expressed either as similarities or dissimilarities) among the objects.

The organization generally takes two forms:

These analyses share many concepts and techniques (both numerical and practical) with other procedures such as principal components analysis, numerical taxonomy, discriminant analysis and so on.

The analyses generally begin with the construction of an n x n matrix D of the distances between objects. For example, in a two-dimensional space, the elements dij of D could be the Euclidean distances between points,

dij = [(xi1- xj1)2 + (xi1- xj1)2]½

The Euclidean distance, and related measures are easily generalized to more than two dimensions.

2 A simple cluster analysis

The basic idea of cluster analysis can be illustrated using the Oregon climate-station dataset, with the goal being to identify groups of stations with similar climates, which also regionalizes the climate of the state. Begin by reading a shape file for mapping the data, and the data itself.

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/bartlein/Projects/ESSD/data/shp_files/OR/orotl.shp", layer: "orotl"
## with 36 features
## It has 1 fields
##     station        lat             lon              elev              tjan             tjul      
##  MLB    : 2   Min.   :42.05   Min.   :-124.6   Min.   :   2.00   Min.   :-6.000   Min.   :12.20  
##  ANT    : 1   1st Qu.:43.60   1st Qu.:-123.2   1st Qu.:  94.25   1st Qu.:-1.225   1st Qu.:17.30  
##  ARL    : 1   Median :44.46   Median :-121.7   Median : 462.00   Median : 0.750   Median :19.20  
##  ASH    : 1   Mean   :44.32   Mean   :-121.3   Mean   : 556.99   Mean   : 1.417   Mean   :19.01  
##  AST    : 1   3rd Qu.:45.28   3rd Qu.:-119.6   3rd Qu.: 863.75   3rd Qu.: 4.150   3rd Qu.:20.20  
##  BAN    : 1   Max.   :46.15   Max.   :-117.0   Max.   :1974.00   Max.   : 8.400   Max.   :26.40  
##  (Other):85                                                                                      
##       tann             pjan             pjul            pann            idnum       
##  Min.   : 3.200   Min.   : 14.00   Min.   : 4.00   Min.   : 198.0   Min.   :350197  
##  1st Qu.: 8.700   1st Qu.: 39.75   1st Qu.: 7.00   1st Qu.: 299.0   1st Qu.:352308  
##  Median :10.400   Median : 87.50   Median : 9.00   Median : 531.0   Median :354564  
##  Mean   : 9.885   Mean   :141.67   Mean   :10.84   Mean   : 868.6   Mean   :354455  
##  3rd Qu.:11.200   3rd Qu.:239.25   3rd Qu.:12.25   3rd Qu.:1355.8   3rd Qu.:356663  
##  Max.   :12.600   Max.   :414.00   Max.   :34.00   Max.   :2517.0   Max.   :359316  
##                               Name   
##  ANTELOPE 1 N USA-OR            : 1  
##  ARLINGTON USA-OR               : 1  
##  ASHLAND 1 N USA-OR             : 1  
##  ASTORIA WSO           //R USA-O: 1  
##  BAKER FAA AIRPORT USA-OR       : 1  
##  BAKER KBKR USA-OR              : 1  
##  (Other)                        :86

Plot the station locations.

2.1 The cluster analysis

In this example, the groups of stations (or the climate regions of Oregon, as expressed in the climate-station data) will be defined via an “agglomerative-hierarchical” cluster analysis. In such an analysis, individual objects (observations) are combined according to their (dis)similarity, with the two most- similar objects being combined first (to create a new composite object), then the next two most similar objects (including the composites), and so on until a single object is defined. The sequence of combinations is illustrated using a “dendrogram” (cluster tree), and this can be examined to determine the number of clusters. The function cutree determines the cluster membership of each observation, which can be listed or plotted.

The first step in an agglomerative cluster analysis (although not always expicitly displayed), is the calculation of a dissimilarity matrix, which is a square symmetrical matrix that illustrates the dissimilarities (typically Euclidean distances) between individual cases or observations, using, in this case the six climate variables to define the space that distance is calculated in.

In the code above, for convience, a matrix X was created to omit the variables we don’t want to include in the clustering (i.e. the non-climatic information). The levelplot() function plots the dissimilarity matrix that the cluster analysis works with. The “plaid” as opposed to completely random appearance of the plot indicates that it is likely that natural groups will exist in the data. The dark diagonal line is formed from the Eucldean distance values of 0.0 from the comparison of each observation with itself.

Now do the cluster analysis:

The function hclust() performs a “Wards’s single-linkage” analysis, and the plot() function plots the dendrogram. Notice that the individual “leaves” of the plot are labeled by the climate-station names.

The dendrogram can be inspected to get an idea of how many distinct clusters there are (and it can also identify unusual points–look for Crater Lake CLK). As a first guess, let’s say there were three distinct clusters evident in the dendrogram. The cuttree() function can be used to “cut” the dendrogram at a particular level:

And then the cluster membership (i.e. which cluster?) of each station can be mapped.

(A listing of the cluster number that each station belongs to can be gotten by the following code.)

2.2 Examining the cluster analysis

The efficacy of the analyis in regionalizing or grouping the stations can be gauged usigns plots and a analysis that looks at the “separation” of the groups of stations.

##         1         2         3 
## -1.466667  4.000000  5.305263

Note how many of the histograms seem multimodal–this is a clear sign of inhomogeneity of the clusters. Let’s see how the clusters differ. This can be done via discriminant analysis, which (like PCA) defines some new variables with desireable properties, in this case, new variables that maximally “spread out” the groups along the new variable axes:

## Call:
## lda(clusternum ~ tjan + tjul + tann + pjan + pjul + pann)
## Prior probabilities of groups:
##         1         2         3 
## 0.5217391 0.2717391 0.2065217 
## Group means:
##        tjan     tjul      tann      pjan     pjul      pann
## 1 -1.466667 20.03958  8.835417  49.04167  9.00000  353.2708
## 2  4.000000 19.36000 11.312000 186.64000  7.88000 1080.1600
## 3  5.305263 15.95263 10.657895 316.52632 19.36842 1891.8947
## Coefficients of linear discriminants:
##               LD1          LD2
## tjan  0.106062251 -1.357953404
## tjul -0.213848899 -1.229994410
## tann  0.521178594  2.210153884
## pjan  0.010726336 -0.009918392
## pjul  0.023851475  0.112201911
## pann  0.001274417  0.001494494
## Proportion of trace:
##   LD1   LD2 
## 0.937 0.063

##             LD1         LD2
## tjan  0.9112303 -0.30450608
## tjul -0.5352598 -0.40124030
## tann  0.5313206 -0.51027660
## pjan  0.9720352  0.09629818
## pjul  0.5199545  0.73301330
## pann  0.9649685  0.16285890

The discriminant analysis here is used to answer the question “how do the clusters of stations differ climatically?” In this case, it looks like pann and tann are the variables most closely correlated with each discriminant function, and because each of these variables are more-or-less averages of the seasonal extreme variables, that might explain why the clusters seem inhomogeneous.

The analysis can be repeated, extracting different numbers of clusters. The easiest way to do this is to copy the three chunks of code above into a text editor, changing the assignment to nclust and including or excluding before executing each chunk, one at a time.

3 Distances: among observations and among variables

As noted above, distance matrices are interesting in their own right, as well as being input to a cluster analysis. In a typical rectangular data set, there are two distance matrices that could be calculated, one comparing variables (i.e. the columns in a data frame), and the other comparing observations (or rows). In the examples below the "TraCE 21ka decadal data are used. There are (96 x 48) 4608 grid points and 2204 observations. Read the raster brick in the netCDF file (Trace21_TREFHT_anm2.nc), and reshape it the usual way:

Notice the use of the transposition function (t()) to get the data in the proper “tidy” arrangement for a data frame (variables in columns, observations in rows). Distance matrix calcualation can take a while, but the parallelDist package can be used to greatly speed things up, by handing off the creation of different blocks of the matrix to differnt processors in multi-processor machines.

The parDist() function gets the dissimilarity martrix:

The dissimilarity martrix can be conveniently visualized by turning it into a raster, and then using the raster plot() function to visualize it.

Note the use of the maxpixels argument to avoid smoothing in the image. The rendering of the image is faster if the image is written to a file:

And here’s the image:

The image is sort of a map: The grid points are arranged in the data frame with longitudes varying rapidly and the sqrare dark area (low dissimilarities) in the middle is the tropics, where the spatial variability of climate is low, while the more spatially variable higher-latitude regions (at all longitudes) frame the low-variability region.

The dissimilarity matrix for variables is gotten in a similar fashion, but not the use of the transpose function t() to get the variables in rows, and observations in columns:

Here the temporal variability of is emphasized.

4 Heatmaps (2-way, time and space clustering)

“Heatmaps” are a visualization approach that is implemented in the base package of R. A heatmap basically consists of the raster-image type plot of the data frame, with the rows and columns of the matrix reaggranged using cluster analyses of the varables and observations. (Note that the cluster analyses are carried out individually, as oppsosed to simultaneously, an approach known as “biclustering”.) Typically, for small data sets, the matrix plot is augmented by dendrograms, and by “side bars” which plot some additional attribute of the data (location or time, for example) using a scatter plot, gradient plot, etc.

4.1 A simple example with Hovmöller-matrix data

A simple heatmap can be generated for a “Hovmöller matrix” calculated using the gridded data. The Hovmöller matrix contains longitudinal averages and is laid out with one row for each latitude and one column for each time.

4.1.3 Hovmöller-matrix heatmap (decimated decadal data)

Now here’s the heatmap:

The heatmap has rearranged the data according to the cluster analyses of the observations and variables, placing more similar data points next to one another. Notice how the side bars allow one to figure out the location and time of individual points. Here, and not surprisingly, there appear to be four distinct time-space clusters (or “regimes”) in the data): glacial vs. interglacial (Holocene) and high latitudes (in both hemispheres) vs. low latitudes.

4.3 Full-grid example

Finally, a heatmap of the gridded data (as opposed to the Hovmöller matrix data) can be obtained