scorpan: a refresher

In a previous lesson we introduced scorpan as a framework for digital soil mapping activities. We said that mathematically, the framework can be expressed as:

\[S=f(s,c,o,r,p,a,n)+ε\]

where we can consider the soil—its classes or properties, \(S\)—can be considered a function of seven soil-explaining factors plus an associated error term, \(ε\), which may be spatially autocorrelated.

We’ll expand on the definition of the scorpan covariates by providing some examples of environmental datasets that may be able to represent them. You may be able to think of some other examples:

Factor Explanation Example
\(s\) soil soil point observations, choropleth maps
\(c\) climate maps of precipitation and temperature
\(o\) organisms (biota) remotely-sensed imagery, land use maps, vegetation maps
\(r\) relief (topography) digital elevation models, terrain attributes
\(p\) parent material geological maps, gamma radiometrics
\(a\) age (time) geological maps
\(n\) spatial position easting, northing, distance from features, topographic exposure

Hunter Valley scorpan covariates

The hunterCovariates dataset in the ithir package is a RasterStack of scorpan covariates. Let’s load it and see what it contains:

library(ithir)
library(raster)

data(hunterCovariates)
names(hunterCovariates) <- c("aacn", "drainage_index", "insolation",
                             "twi", "total_count")
crs(hunterCovariates) <- "+proj=utm +zone=56 +south +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs "
hunterCovariates
## class       : RasterStack 
## dimensions  : 860, 675, 580500, 5  (nrow, ncol, ncell, nlayers)
## resolution  : 25, 25  (x, y)
## extent      : 334497.3, 351372.3, 6362628, 6384128  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=56 +south +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## names       :       aacn, drainage_index, insolation,        twi, total_count 
## min values  :   0.000000,       1.114361, 988.530762,   7.264403,  102.240997 
## max values  :  211.14955,        5.00000, 1969.79663,   22.91941,   555.87250
plot(hunterCovariates)

So hunterCovariates is a RasterStack, and there are five layers within it:

  • aacn is altitude above channel network
  • drainage_index is a soil drainage index developed by Brendan Malone at the University of Sydney. It’s based on soil colour
  • insolation quantifies incoming solar radiation
  • twi is the topographic wetness index devised by Beven & Kirkby (1979)
  • total_count is the total number of gamma ray counts across the whole spectrum measured by a gamma ray spectrometer

Data preparation

Intersect observations with covariates

The first thing we need to do is intersect the observations of 60–100 cm pH with the scorpan covariates. This is easily done by using the extract function in the raster package:

# Load soil observations
data(HV_subsoilpH)

# Convert to simple features data.frame
library(sf)
subsoil_ph <- st_as_sf(HV_subsoilpH,
                       coords = c('X', 'Y'),
                       crs = "+proj=utm +zone=56 +south +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs ")

# Rename the pH column
colnames(subsoil_ph)[1] <- "ph"

# Extract covariates
obs_covariates <- extract(hunterCovariates, as(subsoil_ph, "Spatial"))

Validation data selection

Like we did in a previous exercise, we’ll split our observations into two subsets: one for calibration (model-building) and the other for validation (assessing the model once it’s built). We’ll use the same 100 randomly-selected observations for the validation dataset and use the remainder for calibration.

# Set the random seed so that you get the same results
set.seed(12345)

# Select 100 rows from subsoil_ph at random
val_rows <- sample(nrow(subsoil_ph), 100)

Fit a Cubist regression tree

Decision trees are becoming increasingly popular tools for digital soil mapping practitioners. They are attractive for several reasons. For example, the output is easily comprehensible and readily interpretable, and unlike linear models, they make no assumptions about the distribution of residuals. On the other hand, they are more prone to overfitting than linear models.

We’ll use a regression tree called Cubist here. It is part of the Cubist package, and is also available through the caret package.

Let’s fit a Cubist model to the subsoil pH data.

library(caret)
library(Cubist)

# What parameters does the Cubist model take?
modelLookup(model = "cubist")
##    model  parameter       label forReg forClass probModel
## 1 cubist committees #Committees   TRUE    FALSE     FALSE
## 2 cubist  neighbors  #Instances   TRUE    FALSE     FALSE
# Grid of tuning parameters
grid <- expand.grid(committees = 1, neighbors = 0)

# Fit a single Cubist model using all calibration data
cubist_fit <- train(x = obs_covariates[-val_rows, ],
                    y = subsoil_ph$ph[-val_rows],
                    method = "cubist",
                    trControl = trainControl(method = "none"),
                    tuneGrid = grid)

It’s pretty simple. We discovered the tuneable parameters we need to specify using modelLookup and we turned off resampling by specifying method = "none" (we’ll return to that later).

Let’s take a look at the tree that we fit:

# Get a summary of the Cubist model
summary(cubist_fit$finalModel)
## 
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun Oct 15 08:55:49 2017
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 406 cases (6 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [406 cases, mean 6.502826, range 3.437355 to 9.741758, est err 1.028754]
## 
##  outcome = 2.697171 + 0.033 aacn + 0.196 twi
## 
## 
## Evaluation on training data (406 cases):
## 
##     Average  |error|           1.239357
##     Relative |error|               1.12
##     Correlation coefficient        0.20
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    aacn
##           100%    twi
## 
## 
## Time: 0.0 secs

Now let’s predict onto the validation samples:

ph_val <- subsoil_ph[val_rows, ]
ph_val$z_cubist <- predict(cubist_fit,
                           newdata = obs_covariates[val_rows, ])

We’ll return to the validation in a minute. Finally, we can predict onto the raster grid:

# Predict fitted Cubist model onto covariate grid
z_grid <- raster::predict(hunterCovariates, cubist_fit)
z_grid
## class       : RasterLayer 
## dimensions  : 860, 675, 580500  (nrow, ncol, ncell)
## resolution  : 25, 25  (x, y)
## extent      : 334497.3, 351372.3, 6362628, 6384128  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=56 +south +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## data source : in memory
## names       : layer 
## values      : 5.04674, 11.08893  (min, max)
# Notice that it's already a raster
class(z_grid)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
# Plot the raster
plot(z_grid, main = "Cubist 60-100 cm pH")

Validate the Cubist model

goof(ph_val$ph, ph_val$z_cubist)
##          R2 concordance      MSE     RMSE       bias
## 1 0.0819892   0.1621023 1.627183 1.275611 -0.2365864
library(ggplot2)
ggplot(ph_val, aes(x = ph, y = z_cubist)) +
  geom_abline(colour = "#CCCCCC", linetype = 2) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  xlim(3.0, 9.75) + ylim(3.25, 9.75)

Resampling methods

I mentioned earlier that we’d turn off resampling in train by declaring method = "none". What is resampling all about and why may we want to do it?

Resampling is a statistical technique that is commonly used to estimate the precision of a sample statistic. It involves resampling a master dataset, \(\mathfrak{L}\), to form \(k\) sub-datasets, \(\mathfrak{L}_i\),\(i=1,2,...,k\), which are then used to compute the sample statistic. By so doing, we gain an empirical sampling distribution of size \(k\) samples which can be used to estimate confidence intervals for the statistic.

Bootstrapping pH

One commonly-used resampling method is called bootstrapping. It was devised by Efron (1979) and consists of forming the sub-datasets \(\mathfrak{L}_i\) by random sampling with replacement. If the master dataset has size \(n\), each of the sub-datasets will also have size \(n\) but, due to the sampling with replacement, will only be represented by about two thirds of the original data. In other words, some of the original data will be represented more than once in each sub-dataset.

Bootstrapping may also have other advantages. Breiman (1996) demonstrated how bootstrapping could be used in the context of classification using classification trees. We are dealing with continuous variables rather than categorical variables here, but he made the point that bootstrapping may improve the accuracy of estimates of the target variable if the method used to compute the \(k\) realisations of the target variable is unstable. Unstable methods are those where small changes in \(\mathfrak{L}\), such as embodied in the \(\mathfrak{L}_i\), result in large changes in the target variable. Neural networks, classification trees and regression trees are among those methods that are generally considered unstable.

The train method in the caret package implements a range of resampling methods, including bootstrapping, but resampling in train is mostly used to tune model parameters. Predictions made with a fitted resampled model using predict are aggregated over the bootstrap realisations, whereas we’d like access to the predictions from the realisations themselves.

There are a couple of ways to do this in R. We’ll do it manually so that you can see what’s going on. First, some preliminaries:

# Subset calibration data from subsoil_ph
ph_cal <- subsoil_ph[-val_rows, ]
cal_covariates <- obs_covariates[-val_rows, ]

# Number of bootstrap realisations
k <- 10

# Empty matrix to store validation predictions
val_preds <- matrix(nrow = nrow(ph_val), ncol = k)

# Empty RasterStack to store rasters of bootstrap realisation predictions
z_realisations <- stack()

Here’s the actual bootstrapping engine. This sort of thing is readily parallelisable, but we’ll run it in series here. Just don’t set k to a large value!:

# Do r bootstrap realisations
for(i in 1:k) {
  
  # Generate a bootstrap resample of ph_cal rows
  cal_rows <- sample(nrow(ph_cal), nrow(ph_cal), replace = TRUE)
  
  # Fit a Cubist model
  cubist_fit_boot <- train(x = cal_covariates[cal_rows, ],
                           y = ph_cal$ph[cal_rows],
                           method = "cubist",
                           trControl = trainControl(method = "none"),
                           tuneGrid = grid)
  
  # Predict onto validation samples
  val_preds[, i] <- predict(cubist_fit_boot, 
                            newdata = obs_covariates[val_rows, ])
  
  # Predict onto raster
  z_realisations <- stack(z_realisations,
                          raster::predict(hunterCovariates, cubist_fit_boot))
}

Validating the bootstrap

We’ll look at the raster results in a few minutes but for now, let’s take a look at the validation observations. The val_preds matrix contains the full set of \(k\) boostrap realisation predictions for each validation observation. For each observation, the final prediction of 60–100 cm pH is the mean of these predictions:

# Aggregate the validation predictions
ph_val$z_boot <- apply(val_preds, 1, mean)

# Validate the final bootstrap prediction
goof(ph_val$ph, ph_val$z_boot)
##           R2 concordance      MSE     RMSE       bias
## 1 0.06197528   0.2043041 1.694927 1.301894 -0.1166587

Uncertainty estimation

A key, but often overlooked, component of digital soil mapping, is the quantification of the uncertainty associated with our predictions. Uncertainty arises because soil spatial variation is often complex and we cannot model it perfectly (Heuvelink 2017). In other words, although we may predict the value of a soil property at a given location, the prediction is only an estimate of the true value at that location; we remain uncertain about what the true value really is.

Sources of uncertainty

Heuvelink (2017) distinguishes four sources of uncertainty that influence digital soil mapping predictions:

  • Attribute uncertainty of soil measurements due to improper sampling, improper labelling, improper measurement etc.
  • Positional uncertainty of soil measurements due to global positioning system accuracy, georeferencing errors, vague coordinates, etc.
  • Uncertainty in covariates due to method of generation, method of collection, georeferencing errors, etc.
  • Uncertainty in predictive models due to improper model specification, covariate selection, model choice, etc.

Representation of uncertainty

If we are uncertain about the true value of a soil property at a given location, we may still have enough knowledge to list several likely values and their probabilities. In other words, we may represent the true value as a probability distribution. Consider the following example from Heuvelink (2017): suppose we go out into the field and estimate the sand content of a sample of soil as 35%. Obviously a field measurement of soil texture is relatively imprecise, but we might be confident that the estimation error is not greater than 8%. Therefore it would be reasonable to represent the sand content by a normal distribution with a mean of 35% and a standard deviation of 4%.

The width of the probability distribution reflects the degree of uncertainty we have: when the width is narrower, we are more certain that the mean of the distribution reflects the true value well; when the width is wider, we are less certain that the mean of the distribution reflects the true value well.

Let’s consider how we may use this notion of the probability distribution to represent uncertainty in digital soil property maps in a practical way. A soil property map gives us the best estimation of the soil property at a particular site, yet we may want to have some idea of the potential range of values that the true property value may fall in when we visit the site in person. The prediction interval can help us. For a given level of confidence (say 90%), the prediction interval is an interval estimate of the true value at an unvisited site. In other words we could say that there is a 90% chance that the true value at an unvisited site may fall between the upper and lower bounds of the prediction interval. The narrower is the prediction interval for a given level of confidence, the more certain we are that the true value falls inside the prediction interval (but it doesn’t have to, as we shall see).

The prediction interval is not the same as the confidence interval, which is an interval estimate of the mean of the distribution, and hence will be narrower than the prediction interval at the same level of confidence.

Quantification of uncertainty

The way that we quantify the uncertainty often depends on the method that we use to make the predictions in the first instance. For example, in a previous lesson, we mentioned that the kriging variance is a by-product of kriging interpolation. If we assumed a normal distribution, we could use the prediction and the kriging variance to do some maths and estimate a prediction interval.

Things can be a little trickier with some scorpan-based approaches. For example, machine learning tools like Cubist generally do not provide a prediction variance in addition to the prediction. That is one reason why resampling techniques can be so useful: for any prediction location, the set of \(k\) bootstrap predictions forms a distribution from which we can derive a prediction interval. For each site, we compute the prediction interval limits as the 5th and 95th percentiles of the distribution of its bootstrap predictions. Let’s illustrate this with reference to one of our validation sites, then we’ll go ahead and compute the prediction intervals across the whole study area.

# Plot the bootstrap predictions for one validation site
hist(val_preds[10, ], xlab = "60-100 cm pH", main = NULL)

# Compute the mean of the bootstrap distribution for this validation
# site
abline(v = mean(val_preds[10, ]), lty = 3)

# Compute 90% prediction interval limits as 5th and 95th percentiles
# of the bootstrap distribution for this validation site
abline(v = quantile(val_preds[10, ], probs = c(0.05, 0.95)),
       lty = 3,
       col = "red")

Now let’s compute the prediction intervals across the whole study area.

# Compute mean of bootstrap realisations
z_mean <- raster::calc(z_realisations, mean)
plot(z_mean, main = "Bootstrapped Cubist 60-100 cm pH")

# Compute prediction interval limits
z_interval <- raster::calc(z_realisations, 
                           function(x) {
                             quantile(x,
                                      probs = c(0.05, 0.95),
                                      na.rm = TRUE)
                             })
names(z_interval) <- c("lower", "upper")
plot(z_interval)

# Compute prediction interval width
z_pi_width <- z_interval[[2]] - z_interval[[1]]
plot(z_pi_width, main = "90-percent prediction interval width")

Validate the prediction intervals

We can also validate the prediction intervals, since we have a validation sample of pH values across the area of interest. We do this by computing the proportion of validation pH values that lie within their prediction intervals. If the uncertainties are unbiased, we should expect this proportion to approximate the prediction interval confidence level, which in our case is 90%. Two other situations may also occur:

  • The uncertainty may be overestimated (prediction intervals too wide) if the proportion of validation values inside their prediction intervals is much greater than the confidence level
  • The uncertainty may be underestimated (prediction intervals too narrow) if the proportion of validation values inside their prediction intervals is much smaller than the confidence level
# Compute prediction interval limits for validation samples
val_pi <- t(apply(val_preds, 1, function(x) quantile(x, probs = c(0.05, 0.95))))

# Compute proportion of validation samples inside their prediction
# intervals (should be close to 0.9)
sum((ph_val$ph >= val_pi[, 1]) & (ph_val$ph <= val_pi[, 2])) / nrow(ph_val)
## [1] 0.55

Finally, let’s try to visualise the validation observations in relation to their prediction intervals:

# Create a data.frame that contains all the variables we need
val_pi <- cbind(ph_val$ph, val_pi)
val_pi <- as.data.frame(val_pi)
val_pi <- val_pi[order(val_pi[, 1]), ]
val_pi <- data.frame(1:nrow(val_pi), val_pi)
names(val_pi) <- c("id", "ph", "lower", "upper")

# Create the plot
ggplot(val_pi, aes(id, ph)) +
  geom_linerange(aes(ymin = lower, ymax = upper), colour = "#00bfc4") +
  geom_point(data = val_pi, aes(id, ph), colour = "red", size = 1.3)

References

Beven K, Kirkby MJ 1979. A physically based, variable contributing area model of basin hydrology. Hydrological Sciences Bulletin 24:43–69. doi:10.1080/02626667909491834

Breiman L 1996. Bagging predictors. Machine Learning 24(2):123–140. doi:10.1007/BF00058655

Efron B 1979. Bootstrap Methods: Another look at the jackknife. The Annals of Statistics 7(1):1–26. doi:10.1214/aos/1176344552