1 Introduction

Although although visualization is often used simply to characterize individual variables, its use to understand relationship among variables is probably of greater value (see, for example, Wickham, H, and G. Grolemund, 2017, R for Data Science, O’Reilly UO library. The examples here reproduce some of the analyses in Daniau, A.-L., P.J. Bartlein, S.P. Harrison and 57 others, 2012, Predictability of biomass burning in response to climate changes, Global Biogeochemical Cycles 26, GB4007, doi:10.1029/2011GB004249, which focued on both paleo and present-day relationship between global biomass burning and climate. The present-day analysis focused on the prediction of satellite remote-sensing patterns of biomass burning from the GFEDv3.1 database (Global Fire Emissions, http://www.globalfiredata.org). The idea was to find a small number of predictors of the broadscale patterns of biomass burning, and it turned out that in both the paleo and present-day data sets, those predictors were mean annual temperature and precipitation minus evaporation (P-E, or pme). The analysis below recapitulates some of the analyses in the Daniau et al. paper.

Here, most of the analysis is done using the ggplot2 package for mapping and visualization.

2 Data

The input data consist of the GFEDv3.1 burned fraction (of half-degree grid cells, gfed), several bioclimatic variables (mean annual temperature (mat), mean temperature of the warmest and coldest months (mtwa and mtco respectively), growing degree days (5-deg C base, gdd5), annual precipitation (map), annual equilibrium evapotranspiration (aet), precipitation minus evapotranspiration (pme), lightning-strike frequency (hmrc), population density (gpw), net primary productivity (npp), treecover (treecov), and two categorical variables describing “megabiomes” and potential natural vegetation types (megabiome and vegtype respectively).

Load the necessary packages:

## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:raster':
## 
##     extract

Read the data from a pre-made .csv file.

## 'data.frame':    54642 obs. of  17 variables:
##  $ lon      : num  -73.8 -73.2 -72.8 -72.2 -71.8 ...
##  $ lat      : num  -52.8 -52.8 -52.8 -52.8 -52.8 ...
##  $ gfed     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mat      : num  6.54 5.72 5.16 5.51 5.53 ...
##  $ mtco     : num  3.1 1.9 1 1 0.8 ...
##  $ mtwa     : num  9.7 9.2 8.9 9.6 9.8 ...
##  $ gdd5     : num  690 543 466 569 598 ...
##  $ gpw      : num  1.78 1.78 1.78 1.78 1.78 ...
##  $ hrmc     : num  0.031 0.031 0.031 0.031 0 ...
##  $ map      : int  3992 3222 2485 1829 1217 687 381 309 297 284 ...
##  $ aet      : num  393 335 366 490 585 ...
##  $ alpha    : num  0.667 0.576 0.635 0.833 0.983 ...
##  $ pme      : num  3599 2887 2119 1339 632 ...
##  $ treecov  : int  0 0 0 0 39 0 0 23 0 0 ...
##  $ npp      : num  331 348 381 377 442 ...
##  $ megabiome: Factor w/ 10 levels "BorF","Dsrt",..: 5 5 5 5 5 5 5 1 1 1 ...
##  $ vegtype  : Factor w/ 15 levels "BrDFW","BrEFW",..: 4 6 4 5 5 5 5 5 5 5 ...
##       lon               lat              gfed                mat               mtco        
##  Min.   :-179.75   Min.   :-52.75   Min.   :0.0000000   Min.   :-22.808   Min.   :-51.600  
##  1st Qu.: -60.25   1st Qu.:  7.25   1st Qu.:0.0000000   1st Qu.: -1.358   1st Qu.:-19.100  
##  Median :  31.75   Median : 36.75   Median :0.0001737   Median : 10.946   Median :  0.100  
##  Mean   :  22.12   Mean   : 29.85   Mean   :0.0229601   Mean   : 10.068   Mean   : -1.871  
##  3rd Qu.:  93.75   3rd Qu.: 55.75   3rd Qu.:0.0051268   3rd Qu.: 23.292   3rd Qu.: 16.700  
##  Max.   : 179.75   Max.   : 80.25   Max.   :1.1285924   Max.   : 30.500   Max.   : 28.000  
##                                                                                            
##       mtwa            gdd5           gpw                hrmc             map              aet        
##  Min.   :-1.50   Min.   :   0   Min.   :   0.000   Min.   : 0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:15.60   1st Qu.: 895   1st Qu.:   0.203   1st Qu.: 1.037   1st Qu.: 273.0   1st Qu.: 229.7  
##  Median :22.20   Median :2737   Median :   2.135   Median : 3.112   Median : 500.0   Median : 385.6  
##  Mean   :21.48   Mean   :3667   Mean   :  38.516   Mean   : 7.221   Mean   : 709.6   Mean   : 474.2  
##  3rd Qu.:27.40   3rd Qu.:6678   3rd Qu.:  17.975   3rd Qu.:10.265   3rd Qu.: 927.0   3rd Qu.: 663.4  
##  Max.   :38.20   Max.   :9317   Max.   :7895.593   Max.   :98.868   Max.   :7531.0   Max.   :1538.3  
##                                                                                                      
##      alpha             pme              treecov          npp           megabiome        vegtype     
##  Min.   :0.0000   Min.   : -13.052   Min.   : 0.0   Min.   :   5.9   BorF   :11356   EDMFW  : 8787  
##  1st Qu.:0.4491   1st Qu.:   0.265   1st Qu.: 0.0   1st Qu.: 176.4   GrsShrb: 9372   Savan  : 6766  
##  Median :0.6678   Median :  66.514   Median : 0.0   Median : 340.6   TempF  : 7874   GrStp  : 5958  
##  Mean   :0.6003   Mean   : 235.368   Mean   :21.9   Mean   : 411.4   Dsrt   : 6728   Dsrt   : 5443  
##  3rd Qu.:0.7997   3rd Qu.: 257.029   3rd Qu.:44.0   3rd Qu.: 581.6   TropF  : 6678   TrEFW  : 5225  
##  Max.   :1.0000   Max.   :6073.094   Max.   :80.0   Max.   :1372.7   Tund   : 4896   ShrbO  : 4494  
##                                                                      (Other): 7738   (Other):17969

The two categorical (i.e. factor) variables have levels arranged in the default alphabetical order. The following code reorders the levels into something more ecologically and climatically sensible:

## 'data.frame':    54642 obs. of  2 variables:
##  $ megabiome: Factor w/ 10 levels "TropF","WarmF",..: 9 9 9 9 9 9 9 7 7 7 ...
##  $ vegtype  : Factor w/ 15 levels "TrEFW","TrDFW",..: 8 15 8 10 10 10 10 10 10 10 ...

3 Maps

It’s geneerally useful to map the data first. The following code implements ggplot2 maps. A shapefile of world outlines is read the usual way with the sf package, then converted to a ggplot2 compatible format using the fortify() function:

The map-pattern of the burned-fraction variable can be produced as follows:

4 Set up

4.1 Transformation

Simple inspection of the data and preliminary examination reveals that many of the variables have long-tailed distributions, and non-linear relationships. This can be verified using a faceted plot of histograms. Note the use of the dplyr gather() function to “stack” the variables, and the free_x argument to the `facet_wrap() function to allow each variable to plotted on individual as opposed to joint scales:

For example, burned fraction has the kind of distribution, that if zeros are removed and the values tranformed by taking logarithms, would be approximately normal.

Find the minimum non-zero value of burned fraction:

## [1] -7.618481

Tranform by taking logs (base 10), adding an increment slightly smaller than that minimum value, and check the distribution again.

A few other variables can be transformed in a similar fashion:

Check the distributions again:

4.2 Censor observations with unusual megabiome values

In the following analyses, the megabiome variable will be used to label points and to construct faceted plots. There are a few observations with megabiome values of “None” or “Ice”:

##  [1] "TropF"   "WarmF"   "SavDWd"  "GrsShrb" "Dsrt"    "TempF"   "BorF"    "Tund"    "None"    "Ice"
## 
##   TropF   WarmF  SavDWd GrsShrb    Dsrt   TempF    BorF    Tund    None     Ice 
##    6678    2331    4719    9372    6728    7874   11356    4896     638      50

These can be removed by creating a list of the megabiomes we want to include, and dropping the “None” and “Ice” observations. (The same could be done for the vegtype variable.)

## 
##   TropF   WarmF  SavDWd GrsShrb    Dsrt   TempF    BorF    Tund    None     Ice 
##    6678    2331    4719    9372    6728    7874   11356    4896       0       0

5 Controls of biomass burning

There is a continuum of ideas on what controls biomass burning, which has at one end climate, and the other ignition (lighting and humans). (It’s envogue to always discuss biomass burning from the perspective of controls as a triangle, no matter what the scale of analysis, and so a triangle here would be climate-humans-lightning, but it turns out that humans and lightning aren’t really significant at a global scale–lightning is well correlated with other climate variables, and humans suppress as many fires as they set.)

The goal here is to look at the relationships between burned fraction and the other available predictors. The particular mechanism is a ggplot2 plot, with added “smooths” (including both linear and non-linear (i.e. non-parametric) curves). Although the addition of smooth curves to what are ordinary scatter plots could be viewed as simple plot-summary mechanisms, the are actually the first step in coming up with a systematic relationshp between one of the potential predictors and the response (burned fraction). The following plots consist of simple scatter plots, supplemented by one of two smoothed curves: 1) a data-driven flexible curve, as implemented by a generalized additive model (GAM), which makes no assumption about the form of the relationship, other than it’s relatively smooth; and 2) a linear (i.e. OLS) regression line.

5.1 Bivariate relationship with burned fraction

The first set of plots consist of one-predictor-variable-at-a-time plots, staring with the basic moisture and temperature variables as preditors. Here, for example, is a plot of burned fraction vs. P-E, with a GAM curve added. That curve is specified by the argument (method = 'gam', formula = y ~ s(x, bs = "cs", k = 4 to the geom_smooth() function. The formula argument specifies a GAM with a cubic spline (with “shrinkage”), and relatively small number of “knots” (k=4), which yields a relative smooth curve, one more appropriate form summarizing as opposed to interpolating).

The relationship is evidently nonlinear; burned fraction is highest at intermediate levels of P-E. Here is a similar plot, with the Priestley-Taylor alpha (the ratio AE/PE) as a predictor:

The relationship is similar to that for P-E: burned fraction is low in the driest and the wettest environments, and highest at intermediate levels of moisure.

The other likely important predictor of biomass burning is mean annual temperature mat, which on a global scale controls biomass productivity, and hence fuel avaiabiliy.

The “curve” fit by the GAM model is not too curvy at all, and so it seems reasonable to summarize the relationship by a straight line model:

Other potential relationship can be explored as follows:

5.2 Influence of vegetation type on biomass burning

Vegetation obviously has an impact on biomass burning, with low values in regions of sparse vegetation, like deserts and tundra, and potentially higher values in regions of dense vegetation. The impact on the relationship between one of the potential predictors and burned fraction of vegetation type can be explored by point-labled plots or multi-panel plots.

(Note that in the above plot alpha is used ambiguously, both as a variable name and as a property of the plotted points (i.e. their tranparency).)

Other plots can be constructed:

5.3 A two-predictor Generalized Additive Model (GAM)

From the above, it seems that mean annual temperature mat and P-E pme (or AE/PE alpha) are the useful predictors of burned fraction (gfed). That raises the possibility of including both in a predictive model, which can be implemented using the mgcv package (which was also implicity used above).

Load the mgcv library, and fit a model with burned fraction as the response, and mat and pme as predictors:

## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:raster':
## 
##     getData
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## gf$gfed ~ te(mat, bs = "cs", k = 4) + te(pme, bs = "cs", k = 3)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.897832   0.007725  -504.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df    F p-value    
## te(mat) 2.981      3 1412  <2e-16 ***
## te(pme) 2.000      2 1059  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0989   Deviance explained =  9.9%
## GCV = 3.2199  Scale est. = 3.2196    n = 53954

The first-order diagnostic plot for examining the fit of a GAM model can be gotten with the gam.check() function (not shown here)

An informative diagnostic is provided by the partial residual plots for each variable, which summarize the bivariate relationship between the response and the predictors.

Finally, the bivariate relationship can be revealed by a 3-D response surface.

## Warning in persp.default(new_mat, new_pme, gam01_sfc, theta = -65, phi = 30, : surface extends beyond the
## box

The take-home message here is that burned fraction responds linearly to mean annual temperature, and cuvilinearly with respect to P-E, with (as we have already seen) highest values at intermediate levels of moisture.