1 Introduction

This is a demonstration of how the locfit package can be used to replicate the local-regression composite-curve fitting procedure that is implemented in the smooth_curve.f90 family of Fortran programs. The current version of such Fortran programs is smooth_curve_ninwin.f90 which uses a hybrid approach for determining the window width for the fitted values that make up a composite curve: When there are more than minpts samples (usually 30) available for determining an individual fitted value, then the window half-width specified by the param value in the info file for smooth_curve_ninwin.f90 is used; when there are fewer than minpts values in the window, then the window is expanded until minpts samples fall within the window. This approach was invented for analyses of long charcoal records (e.g. Daniau et al., 2010; Mooney et al., 2011), and is kind of a blend between a constant bandwidth window and a “span” window. For most applications, like Marlon et al. (2012), Vannière et al. (2011) or Power et al. (2012), sample density is sufficient enough that the “span” approach is never invoked.

The examples here follow the analyses in Marlon et al. (2012) of western U.S. charcoal records. First, composite curves fit with the locfit package are compared with those fit by smooth_curve_ninwin.f90, and then the use of the locfit package to get bootstrap sample-by-site confidence intervals is illustrated.

2 Preliminaries

Begin by setting various paths to the data files (presmoothed or “binned” data in this case), and reading a “regions” file or list of sites to process:

# composite curve via the locfit package
# bootstrap-by-site confidence intervals

library(locfit)

# presampled file
csvpath <- "e:/Projects/GPWG/work/wus/data/influx/csv_files/wus2/wus2_presamp/"
csvname <- "_presamp_influx_zt2k_sm10.csv"

# regionfiles
regionpath <- "e:/Projects/GPWG/work/wus/data/influx/regions/wus2/"
regionname <- "wus2_sites.csv"
regionfile <- paste(regionpath, regionname, sep="")

2.1 Read the list of sites

# read the list of sites
wus_sites <- read.csv(regionfile)
head(wus_sites)

##   ID_SI     Lat       Lon Elev
## 1     1 44.6628 -110.6161 2530
## 2     2 44.9183 -110.3467 1884
## 3     3 45.7044 -114.9867 2250
## 4     4 45.8919 -114.2619 2300
## 5     5 46.3206 -114.6503 1770
## 6     6 45.8406 -113.4403 1921
##                                                                  Sitename
## 1  Cygnet                                                                
## 2  Slough Creek Pond                                                     
## 3  Burnt Knob                                                            
## 4  Baker                                                                 
## 5  Hoodoo                                                                
## 6  Pintlar

ns <- length(wus_sites$ID_SI)
ns

## [1] 69

2.2 Define matrices

Next, various matrices for storing input data and fitted values are defined. The parameter maxrecs is set to the largest number of samples any individual site is likely to have, and the matrices age and influx are defined to store the individual records.

# matrices for data and fitted values
maxrecs <- 1000
age <- matrix(NA, ncol=ns, nrow=maxrecs)
influx <- matrix(NA, ncol=ns, nrow=maxrecs)

2.3 Set target ages

Next, a set of “target” ages are defined, for which composite-curve fitted values will be obtained

targage <- data.frame(x=seq(-50,2000,10))
length(targage$x)

## [1] 206

A matrix yfit is defined that will hold the fitted values for each of the maxreps potential bootstrap replications.

maxreps <- 1000
yfit <- matrix(NA, nrow=length(targage$x), ncol=maxreps)

3 Read the data

3.1 Charcoal data

The ns sites with presample (binned) files are read by looping over the wus_sites list of site determined above.

# read and store the presample (binned) files as matrices of ages and influx values
ii <- 0
for (i in 1:ns) {
  sitenum <- wus_sites$ID_SI[i]
  # print(sitenum)
  siteidchar <- as.character(sitenum)
  if (sitenum >= 1) siteid <- paste("000", siteidchar, sep="")
  if (sitenum >= 10) siteid <- paste("00", siteidchar, sep="")
  if (sitenum >= 100) siteid <- paste("0", siteidchar, sep="")
  if (sitenum >= 1000) siteid <- paste(    siteidchar, sep="")
  
  inputfile <- paste(csvpath, siteid, csvname, sep="")
  # print(inputfile)
  
  indata <- read.csv(inputfile)
  nsamp <- length(indata$xtarg)
  if (nsamp > 0) {
    ii <- ii+1
    age[1:nsamp,ii] <- indata$xtarg
    influx[1:nsamp,ii] <- indata$yfit
  }
}
nsites <- ii
nsites

## [1] 66

Note that while there are 69 sites in the region, there are only 66 sites that have samples within the interval of interest in Marlon et al. (2012) (i.e. -60 to 3500 yr BP). Take a look at the first five sites:

head(age[,1:5]); head(influx[,1:5])

##      [,1] [,2] [,3] [,4] [,5]
## [1,]   40   40   30   20   40
## [2,]   50   50   40   30   50
## [3,]   90   60   60   50   60
## [4,]  100   70   70   60   70
## [5,]  130   80   90   80   80
## [6,]  140   90  100   90   90

##          [,1]     [,2]      [,3]      [,4]      [,5]
## [1,] 1.221428 1.874877 -2.026761 -3.293572 -1.178175
## [2,] 1.221428 2.737898 -2.026761 -3.293572 -1.011799
## [3,] 1.843045 3.745885  0.047784 -1.824767 -0.089942
## [4,] 1.843045 1.454438  0.047784 -1.824767  0.209888
## [5,] 1.751002 1.306028 -1.776078  0.429581  0.947839
## [6,] 1.751002 1.519223 -1.776078  0.429581  1.177048

3.2 Read fitted values from Fortran programs

For comparison with the fitted values calculated using the locfit package, read in the fitted values for various window widths calculated by smooth_curve_ninwin.f90.

# get standard lwr fits for comparison
csvfile <- "wus2_sites_boot_influx_zt2k_sm10_0100.csv"
wus_lwr100 <- read.csv(csvfile)
head(wus_lwr100)

##   targ_age influx_zt2k_sm10_0100    sefit   fitcu95   fitcl95  np influx_zt2k_sm10_0100_boot
## 1      -50             -0.311041 0.138363 -0.034315 -0.587767 365                  -0.294355
## 2      -40             -0.294545 0.112409 -0.069728 -0.519362 407                  -0.280888
## 3      -30             -0.292513 0.094485 -0.103543 -0.481483 449                  -0.281082
## 4      -20             -0.292387 0.082492 -0.127402 -0.457372 487                  -0.283797
## 5      -10             -0.284519 0.074327 -0.135865 -0.433172 529                  -0.279600
## 6        0             -0.266382 0.068264 -0.129854 -0.402910 575                  -0.264092
##   yfit_boot_median yfit_boot_se yfit_boot_cu95 yfit_boot_cl95 nsites window ninwin
## 1        -0.289340     0.010388       0.316723      -0.961526     20    100     56
## 2        -0.286513     0.009083       0.253231      -0.863503     24    110     56
## 3        -0.283450     0.008017       0.183805      -0.806999     25    120     56
## 4        -0.284105     0.007227       0.135947      -0.756348     27    130     56
## 5        -0.275095     0.006765       0.127057      -0.714212     34    140     56
## 6        -0.265919     0.006589       0.146592      -0.682071     37    150     56

csvfile <- "wus2_sites_boot_influx_zt2k_sm10_0050.csv"
wus_lwr50 <- read.csv(csvfile)
head(wus_lwr50)

##   targ_age influx_zt2k_sm10_0050    sefit   fitcu95   fitcl95  np influx_zt2k_sm10_0050_boot
## 1      -60             -0.377838 0.282355  0.186871 -0.942547 130                  -0.347443
## 2      -50             -0.367549 0.170339 -0.026871 -0.708228 167                  -0.346411
## 3      -40             -0.329515 0.119968 -0.089580 -0.569450 206                  -0.312186
## 4      -30             -0.292634 0.099149 -0.094337 -0.490931 243                  -0.279276
## 5      -20             -0.272030 0.089228 -0.093573 -0.450486 282                  -0.261759
## 6      -10             -0.269254 0.083461 -0.102332 -0.436177 325                  -0.260824
##   yfit_boot_median yfit_boot_se yfit_boot_cu95 yfit_boot_cl95 nsites window ninwin
## 1        -0.343890     0.012130       0.364576      -1.137482      0     40     31
## 2        -0.336702     0.010119       0.211341      -1.006843     20     50     39
## 3        -0.296076     0.009021       0.215313      -0.896109     24     60     44
## 4        -0.285605     0.008208       0.198580      -0.806842     25     70     45
## 5        -0.265827     0.007687       0.190423      -0.766989     27     80     46
## 6        -0.260845     0.007466       0.210857      -0.739243     34     90     52

csvfile <- "wus2_sites_boot_influx_zt2k_sm10_0025.csv"
wus_lwr25 <- read.csv(csvfile)
head(wus_lwr25)

##   targ_age influx_zt2k_sm10_0025    sefit   fitcu95   fitcl95  np influx_zt2k_sm10_0025_boot
## 1      -60             -0.415270 0.647755  0.880241 -1.710780  44               -9999.000000
## 2      -50             -0.345052 0.220051  0.095051 -0.785154  69                  -0.325112
## 3      -40             -0.331090 0.137191 -0.056707 -0.605472  96                  -0.317873
## 4      -30             -0.293802 0.124393 -0.045015 -0.542589 130                  -0.277065
## 5      -20             -0.255178 0.124455 -0.006268 -0.504088 147                  -0.233367
## 6      -10             -0.218485 0.124256  0.030027 -0.466996 162                  -0.199182
##   yfit_boot_median yfit_boot_se yfit_boot_cu95 yfit_boot_cl95 nsites window ninwin
## 1        -0.914580  -9.9990e+03       0.550864      -0.699302      0     10     24
## 2        -0.314217   1.0376e-02       0.285053      -1.000992     20     20     28
## 3        -0.308492   9.0650e-03       0.213352      -0.904349     24     30     31
## 4        -0.284772   9.0090e-03       0.274345      -0.853511     25     40     39
## 5        -0.238646   9.3490e-03       0.378778      -0.806725     27     40     44
## 6        -0.213442   9.1840e-03       0.378544      -0.782698     34     40     43

4 Composite curves via locfit

4.1 Reshape input data

The first step in applying the locfit() function is to reshape the ns by maxrecs matrices storing age and influx values into long vectors, while also removing NAs.

# reshape matrices into vectors (and drop NAs)
x <- na.omit(as.vector(age))
y <- na.omit(as.vector(influx))

# take a look at the resulting data
head(cbind(x,y),10)

##         x        y
##  [1,]  40 1.221428
##  [2,]  50 1.221428
##  [3,]  90 1.843045
##  [4,] 100 1.843045
##  [5,] 130 1.751002
##  [6,] 140 1.751002
##  [7,] 180 1.497879
##  [8,] 190 1.497879
##  [9,] 230 1.037331
## [10,] 240 1.037331

tail(cbind(x,y),10)

##             x        y
## [13799,] 3810 2.436500
## [13800,] 3840 1.917886
## [13801,] 3850 1.917886
## [13802,] 3880 1.580854
## [13803,] 3890 1.580854
## [13804,] 3920 2.570799
## [13805,] 3930 2.570799
## [13806,] 3960 2.636206
## [13807,] 3970 2.636206
## [13808,] 4000 5.292959

4.2 Initial locfit curve

Next, set a window half-width, and execute the locfit() function. For this first example, a (half) window width of 25 yrs is used:

# locfit window-width parameter
hw <- 25 # window half width (smoothing parameter)

# initial fit, unresampled (i.e. all) data
loc01 <- locfit(y ~ lp(x, deg=1, h=hw), maxk=200, family="qrgauss")
summary(loc01)

## Estimation type: Local Regression 
## 
## Call:
## locfit(formula = y ~ lp(x, deg = 1, h = hw), maxk = 200, family = "qrgauss")
## 
## Number of data points:  13808 
## Independent variables:  x 
## Evaluation structure: Rectangular Tree 
## Number of evaluation points:  257 
## Degree of fit:  1 
## Fitted Degrees of Freedom:  68.607

The various arguments that produce a fit similar to that provided by smooth_curve_ninwin.f90 are as follows: The argument deg=1 fits a local linear regression, with a (half) window width h of 25 years in this case (hw). The argument maxk=200 provides sufficient internal storage to allow determining fitted values at relatively small increments (10 years in this case). The family="qrgauss" argument invokes a “robustness” step in fitting, where (as in smooth_curve_ninwin.f90) the influence of outliers or unusual points are downweighted.

The fitted values are obtained using predict(), at the target years defined earlier. The fitted values calculated using smooth_curve_ninwin.f90 are added to the plot for comparison.

# get and plot fitted values
pred01 <- predict(loc01, newdata=targage, se.fit=TRUE)
wus_locfit <- data.frame(targage$x, pred01$fit)
fitname <- paste("locfit_",as.character(hw), sep="")
colnames(wus_locfit) <- c("age", fitname)
head(wus_locfit)

##   age  locfit_25
## 1 -50 -0.2864345
## 2 -40 -0.2901339
## 3 -30 -0.2520387
## 4 -20 -0.1936271
## 5 -10 -0.1721479
## 6   0 -0.2141500

plot(wus_locfit[,1], wus_locfit[,2], xlab="age", ylab=fitname, type="l", lwd=2, col="red")
# add fitted values from smooth_curve_ninwin.f90
lwrfitname <- paste("wus_lwr",as.character(hw), sep="")
lines(eval(as.name(lwrfitname))[,1], eval(as.name(lwrfitname))[,2], lwd=2, col="blue")

Figure 1. Comparison of locfit and Fortran curves, hw=25

The agreement is pretty damn good, with only a little difference at the end of the records, probably related to various ad-hoc decisions implemented in the lowess2_module.f90 module that smooth_curve_ninwin.f90 uses. (Could have saved a lot of time…)

Here are similar comparisons for window widths of 50…

hw <- 50 # window half width (smoothing parameter)
loc02 <- locfit(y ~ lp(x, deg=1, h=hw), maxk=200, family="qrgauss")
summary(loc02)

## Estimation type: Local Regression 
## 
## Call:
## locfit(formula = y ~ lp(x, deg = 1, h = hw), maxk = 200, family = "qrgauss")
## 
## Number of data points:  13808 
## Independent variables:  x 
## Evaluation structure: Rectangular Tree 
## Number of evaluation points:  129 
## Degree of fit:  1 
## Fitted Degrees of Freedom:  34.942

# get and plot fitted values
pred02 <- predict(loc02, newdata=targage, se.fit=TRUE)
wus_locfit <- data.frame(targage$x, pred02$fit)
fitname <- paste("locfit_",as.character(hw), sep="")
colnames(wus_locfit) <- c("age", fitname)
head(wus_locfit)

##   age  locfit_50
## 1 -50 -0.3247606
## 2 -40 -0.2868574
## 3 -30 -0.2562198
## 4 -20 -0.2381492
## 5 -10 -0.2342783
## 6   0 -0.2391529

plot(wus_locfit[,1], wus_locfit[,2], xlab="age", ylab=fitname, type="l", lwd=2, col="red")
# add fitted values from smooth_curve_ninwin.f90
lwrfitname <- paste("wus_lwr",as.character(hw), sep="")
lines(eval(as.name(lwrfitname))[,1], eval(as.name(lwrfitname))[,2], lwd=2, col="blue")

Figure 2. Comparison of locfit and Fortran curves, hw=50

and 100…

hw <- 100 # window half width (smoothing parameter)
loc03 <- locfit(y ~ lp(x, deg=1, h=hw), maxk=200, family="qrgauss")
summary(loc03)

## Estimation type: Local Regression 
## 
## Call:
## locfit(formula = y ~ lp(x, deg = 1, h = hw), maxk = 200, family = "qrgauss")
## 
## Number of data points:  13808 
## Independent variables:  x 
## Evaluation structure: Rectangular Tree 
## Number of evaluation points:  65 
## Degree of fit:  1 
## Fitted Degrees of Freedom:  18.156

# get and plot fitted values
pred03 <- predict(loc03, newdata=targage, se.fit=TRUE)
wus_locfit <- data.frame(targage$x, pred03$fit)
fitname <- paste("locfit_",as.character(hw), sep="")
colnames(wus_locfit) <- c("age", fitname)
head(wus_locfit)

##   age locfit_100
## 1 -50 -0.2548945
## 2 -40 -0.2536073
## 3 -30 -0.2522736
## 4 -20 -0.2487952
## 5 -10 -0.2410737
## 6   0 -0.2270108

plot(wus_locfit[,1], wus_locfit[,2], xlab="age", ylab=fitname, type="l", lwd=2, col="red")
# add fitted values from smooth_curve_ninwin.f90
lwrfitname <- paste("wus_lwr",as.character(hw), sep="")
lines(eval(as.name(lwrfitname))[,1], eval(as.name(lwrfitname))[,2], lwd=2, col="blue")

Figure 3. Comparison of locfit and Fortran curves, hw=100

At this point it’s pretty clear that “pure R” can be used to replicate the composite curve fitting done by the Fortran program.

5 Bootstrap-by-site confidence intervals

Bootstrap confidence intervals that reflect the influence of the particular collection of sites on the composite curve can be developed by repeatedly sampling (with replacement) the individual site records (as opposed to the usual bootstrapping approach where individual data values are sampled with replacement). Bootstrapping-by-site is appropriate here, because the issue is how sensitive the composite curve is to the particular collection of sites that contribute to the curve, as opposed to how sensitive the curve is to individual samples. (Because a robustness-iteration approach is used in applying the locfit() function, the influence of individual samples has already been minimized.)

5.1 Fittting and plotting

For the example here, we’ll use a (half) window width of 50. First, set up a plot in which the individual composite curves for each bootstrap replication will be plotted. There are multiple steps in the procedure:

# Step 1 -- Set up to plot individual replications
plot(x, y, xlab="age", ylab="locfit_50", ylim=c(-1,1), xlim=c(0,2000), type="n")

# Step 2 -- Do the bootstrap iterations, and plot each composite curve
hw <- 50
nreps <- 1000
set.seed(10) # do this to get the same sequence of random samples for each run

for (i in 1:nreps) {
  #print(i)
  randsitenum <- sample(seq(1:nsites), nsites, replace=TRUE)
  # print(head(randsitenum))
  x <- na.omit(as.vector(age[,randsitenum]))
  y <- na.omit(as.vector(influx[,randsitenum]))
  locboot <- locfit(y ~ lp(x, deg=1, h=hw), maxk=200, family="qrgauss")
  predboot <- predict(locboot, newdata=targage, se.fit=TRUE)
  yfit[,i] <- predboot$fit
  # note plotting lines is slowww
  lines(targage$x, yfit[,i], lwd=2, col=rgb(0.5,0.5,0.5,0.10))
  }

# Step 3 -- Plot the unresampled (initial) fit (for hw=50)
wus_locfit <- data.frame(targage$x, pred02$fit)
fitname <- paste("locfit_",as.character(hw), sep="")
colnames(wus_locfit) <- c("age", fitname)
lines(wus_locfit[,1], wus_locfit[,2], lwd=2, col="red")

# Step 4 -- Find and add bootstrap CIs
yfit95 <- apply(yfit, 1, function(x) quantile(x,prob=0.975, na.rm=T))
yfit05 <- apply(yfit, 1, function(x) quantile(x,prob=0.025, na.rm=T))
lines(targage$x, yfit95, lwd=1, col="red")
lines(targage$x, yfit05, lwd=1, col="red")

Figure 4. Bootstrap confidence intervals

The first step simply creates a frame for plotting the individual composite curves (in a faint and transparent grey color). Step 2 executes nreps iterations of generating a random list of sites (randsitenum), and then the two lines

  x <- na.omit(as.vector(age[,randsitenum]))
  y <- na.omit(as.vector(influx[,randsitenum]))

generate vectors of age and influx values by stacking the random sample of individual sites. Then the composite curve for that sample is generated and plotted.

Step 3 overplots the “unresampled” composite curve, i.e. that for all of the samples, as they were read in. Step 4 gets the 5th and 95th percentiles of the nreps composite curves at each target point, and then plots those values.

6 Comparison with Fortran-calculated composite curves

Finally, the composite curves and confidence intervals calculated using the locfit() function can be compared with those calculated using smooth_curve_ninwin.f90

# comparison with Fortran calculated values
plot(x, y, xlab="age", ylab="locfit_50", ylim=c(-1,1), xlim=c(0,2000), type="n")
# locfit curves
lines(wus_locfit[,1], wus_locfit[,2], lwd=2, col="red")
lines(targage$x, yfit95, lwd=1, col="red")
lines(targage$x, yfit05, lwd=1, col="red")
# smooth_curve_ninwin.f90 curves
lines(wus_lwr50$targ_age, wus_lwr50$influx_zt2k_sm10_0050, lwd=2, col="blue")
lines(wus_lwr50$targ_age, wus_lwr50$yfit_boot_cu95, lwd=1, col="blue")
lines(wus_lwr50$targ_age, wus_lwr50$yfit_boot_cl95, lwd=1, col="blue")

Figure 5. Comparison of locfit and Fortran fit curves

Pretty good. The variations in the CIs are not significant. If the same list of random samples could be generated in the Fortran and locfit() approaches, I think the curves would be identical.

Using the locfit Package for Composite Curves