This is a demonstration of how the locfit package can be used to replicate the local-regression composite-curve fitting procedure that is implemented in the smooth_curve.f90
family of Fortran programs. The current version of such Fortran programs is smooth_curve_ninwin.f90
which uses a hybrid approach for determining the window width for the fitted values that make up a composite curve: When there are more than minpts
samples (usually 30) available for determining an individual fitted value, then the window half-width specified by the param
value in the info file for smooth_curve_ninwin.f90
is used; when there are fewer than minpts
values in the window, then the window is expanded until minpts
samples fall within the window. This approach was invented for analyses of long charcoal records (e.g. Daniau et al., 2010; Mooney et al., 2011), and is kind of a blend between a constant bandwidth window and a “span” window. For most applications, like Marlon et al. (2012), Vannière et al. (2011) or Power et al. (2012), sample density is sufficient enough that the “span” approach is never invoked.
The examples here follow the analyses in Marlon et al. (2012) of western U.S. charcoal records. First, composite curves fit with the locfit package are compared with those fit by smooth_curve_ninwin.f90
, and then the use of the locfit package to get bootstrap sample-by-site confidence intervals is illustrated.
Begin by setting various paths to the data files (presmoothed or “binned” data in this case), and reading a “regions” file or list of sites to process:
# composite curve via the locfit package
# bootstrap-by-site confidence intervals
library(locfit)
# presampled file
csvpath <- "e:/Projects/GPWG/work/wus/data/influx/csv_files/wus2/wus2_presamp/"
csvname <- "_presamp_influx_zt2k_sm10.csv"
# regionfiles
regionpath <- "e:/Projects/GPWG/work/wus/data/influx/regions/wus2/"
regionname <- "wus2_sites.csv"
regionfile <- paste(regionpath, regionname, sep="")
# read the list of sites
wus_sites <- read.csv(regionfile)
head(wus_sites)
## ID_SI Lat Lon Elev
## 1 1 44.6628 -110.6161 2530
## 2 2 44.9183 -110.3467 1884
## 3 3 45.7044 -114.9867 2250
## 4 4 45.8919 -114.2619 2300
## 5 5 46.3206 -114.6503 1770
## 6 6 45.8406 -113.4403 1921
## Sitename
## 1 Cygnet
## 2 Slough Creek Pond
## 3 Burnt Knob
## 4 Baker
## 5 Hoodoo
## 6 Pintlar
ns <- length(wus_sites$ID_SI)
ns
## [1] 69
Next, various matrices for storing input data and fitted values are defined. The parameter maxrecs
is set to the largest number of samples any individual site is likely to have, and the matrices age
and influx
are defined to store the individual records.
# matrices for data and fitted values
maxrecs <- 1000
age <- matrix(NA, ncol=ns, nrow=maxrecs)
influx <- matrix(NA, ncol=ns, nrow=maxrecs)
Next, a set of “target” ages are defined, for which composite-curve fitted values will be obtained
targage <- data.frame(x=seq(-50,2000,10))
length(targage$x)
## [1] 206
A matrix yfit
is defined that will hold the fitted values for each of the maxreps
potential bootstrap replications.
maxreps <- 1000
yfit <- matrix(NA, nrow=length(targage$x), ncol=maxreps)
The ns
sites with presample (binned) files are read by looping over the wus_sites
list of site determined above.
# read and store the presample (binned) files as matrices of ages and influx values
ii <- 0
for (i in 1:ns) {
sitenum <- wus_sites$ID_SI[i]
# print(sitenum)
siteidchar <- as.character(sitenum)
if (sitenum >= 1) siteid <- paste("000", siteidchar, sep="")
if (sitenum >= 10) siteid <- paste("00", siteidchar, sep="")
if (sitenum >= 100) siteid <- paste("0", siteidchar, sep="")
if (sitenum >= 1000) siteid <- paste( siteidchar, sep="")
inputfile <- paste(csvpath, siteid, csvname, sep="")
# print(inputfile)
indata <- read.csv(inputfile)
nsamp <- length(indata$xtarg)
if (nsamp > 0) {
ii <- ii+1
age[1:nsamp,ii] <- indata$xtarg
influx[1:nsamp,ii] <- indata$yfit
}
}
nsites <- ii
nsites
## [1] 66
Note that while there are 69 sites in the region, there are only 66 sites that have samples within the interval of interest in Marlon et al. (2012) (i.e. -60 to 3500 yr BP). Take a look at the first five sites:
head(age[,1:5]); head(influx[,1:5])
## [,1] [,2] [,3] [,4] [,5]
## [1,] 40 40 30 20 40
## [2,] 50 50 40 30 50
## [3,] 90 60 60 50 60
## [4,] 100 70 70 60 70
## [5,] 130 80 90 80 80
## [6,] 140 90 100 90 90
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.221428 1.874877 -2.026761 -3.293572 -1.178175
## [2,] 1.221428 2.737898 -2.026761 -3.293572 -1.011799
## [3,] 1.843045 3.745885 0.047784 -1.824767 -0.089942
## [4,] 1.843045 1.454438 0.047784 -1.824767 0.209888
## [5,] 1.751002 1.306028 -1.776078 0.429581 0.947839
## [6,] 1.751002 1.519223 -1.776078 0.429581 1.177048
For comparison with the fitted values calculated using the locfit package, read in the fitted values for various window widths calculated by smooth_curve_ninwin.f90
.
# get standard lwr fits for comparison
csvfile <- "wus2_sites_boot_influx_zt2k_sm10_0100.csv"
wus_lwr100 <- read.csv(csvfile)
head(wus_lwr100)
## targ_age influx_zt2k_sm10_0100 sefit fitcu95 fitcl95 np influx_zt2k_sm10_0100_boot
## 1 -50 -0.311041 0.138363 -0.034315 -0.587767 365 -0.294355
## 2 -40 -0.294545 0.112409 -0.069728 -0.519362 407 -0.280888
## 3 -30 -0.292513 0.094485 -0.103543 -0.481483 449 -0.281082
## 4 -20 -0.292387 0.082492 -0.127402 -0.457372 487 -0.283797
## 5 -10 -0.284519 0.074327 -0.135865 -0.433172 529 -0.279600
## 6 0 -0.266382 0.068264 -0.129854 -0.402910 575 -0.264092
## yfit_boot_median yfit_boot_se yfit_boot_cu95 yfit_boot_cl95 nsites window ninwin
## 1 -0.289340 0.010388 0.316723 -0.961526 20 100 56
## 2 -0.286513 0.009083 0.253231 -0.863503 24 110 56
## 3 -0.283450 0.008017 0.183805 -0.806999 25 120 56
## 4 -0.284105 0.007227 0.135947 -0.756348 27 130 56
## 5 -0.275095 0.006765 0.127057 -0.714212 34 140 56
## 6 -0.265919 0.006589 0.146592 -0.682071 37 150 56
csvfile <- "wus2_sites_boot_influx_zt2k_sm10_0050.csv"
wus_lwr50 <- read.csv(csvfile)
head(wus_lwr50)
## targ_age influx_zt2k_sm10_0050 sefit fitcu95 fitcl95 np influx_zt2k_sm10_0050_boot
## 1 -60 -0.377838 0.282355 0.186871 -0.942547 130 -0.347443
## 2 -50 -0.367549 0.170339 -0.026871 -0.708228 167 -0.346411
## 3 -40 -0.329515 0.119968 -0.089580 -0.569450 206 -0.312186
## 4 -30 -0.292634 0.099149 -0.094337 -0.490931 243 -0.279276
## 5 -20 -0.272030 0.089228 -0.093573 -0.450486 282 -0.261759
## 6 -10 -0.269254 0.083461 -0.102332 -0.436177 325 -0.260824
## yfit_boot_median yfit_boot_se yfit_boot_cu95 yfit_boot_cl95 nsites window ninwin
## 1 -0.343890 0.012130 0.364576 -1.137482 0 40 31
## 2 -0.336702 0.010119 0.211341 -1.006843 20 50 39
## 3 -0.296076 0.009021 0.215313 -0.896109 24 60 44
## 4 -0.285605 0.008208 0.198580 -0.806842 25 70 45
## 5 -0.265827 0.007687 0.190423 -0.766989 27 80 46
## 6 -0.260845 0.007466 0.210857 -0.739243 34 90 52
csvfile <- "wus2_sites_boot_influx_zt2k_sm10_0025.csv"
wus_lwr25 <- read.csv(csvfile)
head(wus_lwr25)
## targ_age influx_zt2k_sm10_0025 sefit fitcu95 fitcl95 np influx_zt2k_sm10_0025_boot
## 1 -60 -0.415270 0.647755 0.880241 -1.710780 44 -9999.000000
## 2 -50 -0.345052 0.220051 0.095051 -0.785154 69 -0.325112
## 3 -40 -0.331090 0.137191 -0.056707 -0.605472 96 -0.317873
## 4 -30 -0.293802 0.124393 -0.045015 -0.542589 130 -0.277065
## 5 -20 -0.255178 0.124455 -0.006268 -0.504088 147 -0.233367
## 6 -10 -0.218485 0.124256 0.030027 -0.466996 162 -0.199182
## yfit_boot_median yfit_boot_se yfit_boot_cu95 yfit_boot_cl95 nsites window ninwin
## 1 -0.914580 -9.9990e+03 0.550864 -0.699302 0 10 24
## 2 -0.314217 1.0376e-02 0.285053 -1.000992 20 20 28
## 3 -0.308492 9.0650e-03 0.213352 -0.904349 24 30 31
## 4 -0.284772 9.0090e-03 0.274345 -0.853511 25 40 39
## 5 -0.238646 9.3490e-03 0.378778 -0.806725 27 40 44
## 6 -0.213442 9.1840e-03 0.378544 -0.782698 34 40 43
The first step in applying the locfit()
function is to reshape the ns
by maxrecs
matrices storing age
and influx
values into long vectors, while also removing NAs
.
# reshape matrices into vectors (and drop NAs)
x <- na.omit(as.vector(age))
y <- na.omit(as.vector(influx))
# take a look at the resulting data
head(cbind(x,y),10)
## x y
## [1,] 40 1.221428
## [2,] 50 1.221428
## [3,] 90 1.843045
## [4,] 100 1.843045
## [5,] 130 1.751002
## [6,] 140 1.751002
## [7,] 180 1.497879
## [8,] 190 1.497879
## [9,] 230 1.037331
## [10,] 240 1.037331
tail(cbind(x,y),10)
## x y
## [13799,] 3810 2.436500
## [13800,] 3840 1.917886
## [13801,] 3850 1.917886
## [13802,] 3880 1.580854
## [13803,] 3890 1.580854
## [13804,] 3920 2.570799
## [13805,] 3930 2.570799
## [13806,] 3960 2.636206
## [13807,] 3970 2.636206
## [13808,] 4000 5.292959
Next, set a window half-width, and execute the locfit()
function. For this first example, a (half) window width of 25 yrs is used:
# locfit window-width parameter
hw <- 25 # window half width (smoothing parameter)
# initial fit, unresampled (i.e. all) data
loc01 <- locfit(y ~ lp(x, deg=1, h=hw), maxk=200, family="qrgauss")
summary(loc01)
## Estimation type: Local Regression
##
## Call:
## locfit(formula = y ~ lp(x, deg = 1, h = hw), maxk = 200, family = "qrgauss")
##
## Number of data points: 13808
## Independent variables: x
## Evaluation structure: Rectangular Tree
## Number of evaluation points: 257
## Degree of fit: 1
## Fitted Degrees of Freedom: 68.607
The various arguments that produce a fit similar to that provided by smooth_curve_ninwin.f90
are as follows: The argument deg=1
fits a local linear regression, with a (half) window width h
of 25 years in this case (hw
). The argument maxk=200
provides sufficient internal storage to allow determining fitted values at relatively small increments (10 years in this case). The family="qrgauss"
argument invokes a “robustness” step in fitting, where (as in smooth_curve_ninwin.f90
) the influence of outliers or unusual points are downweighted.
The fitted values are obtained using predict()
, at the target years defined earlier. The fitted values calculated using smooth_curve_ninwin.f90
are added to the plot for comparison.
# get and plot fitted values
pred01 <- predict(loc01, newdata=targage, se.fit=TRUE)
wus_locfit <- data.frame(targage$x, pred01$fit)
fitname <- paste("locfit_",as.character(hw), sep="")
colnames(wus_locfit) <- c("age", fitname)
head(wus_locfit)
## age locfit_25
## 1 -50 -0.2864345
## 2 -40 -0.2901339
## 3 -30 -0.2520387
## 4 -20 -0.1936271
## 5 -10 -0.1721479
## 6 0 -0.2141500
plot(wus_locfit[,1], wus_locfit[,2], xlab="age", ylab=fitname, type="l", lwd=2, col="red")
# add fitted values from smooth_curve_ninwin.f90
lwrfitname <- paste("wus_lwr",as.character(hw), sep="")
lines(eval(as.name(lwrfitname))[,1], eval(as.name(lwrfitname))[,2], lwd=2, col="blue")
The agreement is pretty damn good, with only a little difference at the end of the records, probably related to various ad-hoc decisions implemented in the lowess2_module.f90
module that smooth_curve_ninwin.f90
uses. (Could have saved a lot of time…)
Here are similar comparisons for window widths of 50…
hw <- 50 # window half width (smoothing parameter)
loc02 <- locfit(y ~ lp(x, deg=1, h=hw), maxk=200, family="qrgauss")
summary(loc02)
## Estimation type: Local Regression
##
## Call:
## locfit(formula = y ~ lp(x, deg = 1, h = hw), maxk = 200, family = "qrgauss")
##
## Number of data points: 13808
## Independent variables: x
## Evaluation structure: Rectangular Tree
## Number of evaluation points: 129
## Degree of fit: 1
## Fitted Degrees of Freedom: 34.942
# get and plot fitted values
pred02 <- predict(loc02, newdata=targage, se.fit=TRUE)
wus_locfit <- data.frame(targage$x, pred02$fit)
fitname <- paste("locfit_",as.character(hw), sep="")
colnames(wus_locfit) <- c("age", fitname)
head(wus_locfit)
## age locfit_50
## 1 -50 -0.3247606
## 2 -40 -0.2868574
## 3 -30 -0.2562198
## 4 -20 -0.2381492
## 5 -10 -0.2342783
## 6 0 -0.2391529
plot(wus_locfit[,1], wus_locfit[,2], xlab="age", ylab=fitname, type="l", lwd=2, col="red")
# add fitted values from smooth_curve_ninwin.f90
lwrfitname <- paste("wus_lwr",as.character(hw), sep="")
lines(eval(as.name(lwrfitname))[,1], eval(as.name(lwrfitname))[,2], lwd=2, col="blue")
and 100…
hw <- 100 # window half width (smoothing parameter)
loc03 <- locfit(y ~ lp(x, deg=1, h=hw), maxk=200, family="qrgauss")
summary(loc03)
## Estimation type: Local Regression
##
## Call:
## locfit(formula = y ~ lp(x, deg = 1, h = hw), maxk = 200, family = "qrgauss")
##
## Number of data points: 13808
## Independent variables: x
## Evaluation structure: Rectangular Tree
## Number of evaluation points: 65
## Degree of fit: 1
## Fitted Degrees of Freedom: 18.156
# get and plot fitted values
pred03 <- predict(loc03, newdata=targage, se.fit=TRUE)
wus_locfit <- data.frame(targage$x, pred03$fit)
fitname <- paste("locfit_",as.character(hw), sep="")
colnames(wus_locfit) <- c("age", fitname)
head(wus_locfit)
## age locfit_100
## 1 -50 -0.2548945
## 2 -40 -0.2536073
## 3 -30 -0.2522736
## 4 -20 -0.2487952
## 5 -10 -0.2410737
## 6 0 -0.2270108
plot(wus_locfit[,1], wus_locfit[,2], xlab="age", ylab=fitname, type="l", lwd=2, col="red")
# add fitted values from smooth_curve_ninwin.f90
lwrfitname <- paste("wus_lwr",as.character(hw), sep="")
lines(eval(as.name(lwrfitname))[,1], eval(as.name(lwrfitname))[,2], lwd=2, col="blue")
At this point it’s pretty clear that “pure R” can be used to replicate the composite curve fitting done by the Fortran program.
Bootstrap confidence intervals that reflect the influence of the particular collection of sites on the composite curve can be developed by repeatedly sampling (with replacement) the individual site records (as opposed to the usual bootstrapping approach where individual data values are sampled with replacement). Bootstrapping-by-site is appropriate here, because the issue is how sensitive the composite curve is to the particular collection of sites that contribute to the curve, as opposed to how sensitive the curve is to individual samples. (Because a robustness-iteration approach is used in applying the locfit()
function, the influence of individual samples has already been minimized.)
For the example here, we’ll use a (half) window width of 50. First, set up a plot in which the individual composite curves for each bootstrap replication will be plotted. There are multiple steps in the procedure:
# Step 1 -- Set up to plot individual replications
plot(x, y, xlab="age", ylab="locfit_50", ylim=c(-1,1), xlim=c(0,2000), type="n")
# Step 2 -- Do the bootstrap iterations, and plot each composite curve
hw <- 50
nreps <- 1000
set.seed(10) # do this to get the same sequence of random samples for each run
for (i in 1:nreps) {
#print(i)
randsitenum <- sample(seq(1:nsites), nsites, replace=TRUE)
# print(head(randsitenum))
x <- na.omit(as.vector(age[,randsitenum]))
y <- na.omit(as.vector(influx[,randsitenum]))
locboot <- locfit(y ~ lp(x, deg=1, h=hw), maxk=200, family="qrgauss")
predboot <- predict(locboot, newdata=targage, se.fit=TRUE)
yfit[,i] <- predboot$fit
# note plotting lines is slowww
lines(targage$x, yfit[,i], lwd=2, col=rgb(0.5,0.5,0.5,0.10))
}
# Step 3 -- Plot the unresampled (initial) fit (for hw=50)
wus_locfit <- data.frame(targage$x, pred02$fit)
fitname <- paste("locfit_",as.character(hw), sep="")
colnames(wus_locfit) <- c("age", fitname)
lines(wus_locfit[,1], wus_locfit[,2], lwd=2, col="red")
# Step 4 -- Find and add bootstrap CIs
yfit95 <- apply(yfit, 1, function(x) quantile(x,prob=0.975, na.rm=T))
yfit05 <- apply(yfit, 1, function(x) quantile(x,prob=0.025, na.rm=T))
lines(targage$x, yfit95, lwd=1, col="red")
lines(targage$x, yfit05, lwd=1, col="red")
The first step simply creates a frame for plotting the individual composite curves (in a faint and transparent grey color). Step 2 executes nreps
iterations of generating a random list of sites (randsitenum
), and then the two lines
x <- na.omit(as.vector(age[,randsitenum]))
y <- na.omit(as.vector(influx[,randsitenum]))
generate vectors of age
and influx
values by stacking the random sample of individual sites. Then the composite curve for that sample is generated and plotted.
Step 3 overplots the “unresampled” composite curve, i.e. that for all of the samples, as they were read in. Step 4 gets the 5th and 95th percentiles of the nreps
composite curves at each target point, and then plots those values.
Finally, the composite curves and confidence intervals calculated using the locfit()
function can be compared with those calculated using smooth_curve_ninwin.f90
# comparison with Fortran calculated values
plot(x, y, xlab="age", ylab="locfit_50", ylim=c(-1,1), xlim=c(0,2000), type="n")
# locfit curves
lines(wus_locfit[,1], wus_locfit[,2], lwd=2, col="red")
lines(targage$x, yfit95, lwd=1, col="red")
lines(targage$x, yfit05, lwd=1, col="red")
# smooth_curve_ninwin.f90 curves
lines(wus_lwr50$targ_age, wus_lwr50$influx_zt2k_sm10_0050, lwd=2, col="blue")
lines(wus_lwr50$targ_age, wus_lwr50$yfit_boot_cu95, lwd=1, col="blue")
lines(wus_lwr50$targ_age, wus_lwr50$yfit_boot_cl95, lwd=1, col="blue")
Pretty good. The variations in the CIs are not significant. If the same list of random samples could be generated in the Fortran and locfit()
approaches, I think the curves would be identical.