We’ve learned a range of things so far, including basic programming in R and how to use R to manipulate spatial data. Now it’s time to do something a bit more soil-oriented: we want to make use of what we’ve learned so far to create some maps of soil.

Some definitions

A digital soil map is a map of some soil property or set of soil classes that has been produced by statistical inference. Digital soil maps are also sometimes called predictive soil maps or quantitative soil maps, but their essence is the same. They are fundamentally different in their method of production to digitised soil maps, which are legacy soil maps that have been scanned from paper to a raster format, orthorectified and georeferenced, then digitised to a vector format in a geographic information system. Legacy soil maps are typically those soil maps that were produced by hand by tracing lines on aerial photographs. The lines are not random but are based on a soil-landscape model that relates the spatial variation in soil to landscape and environmental features.

A generic model for the prediction of soil

The premise behind contemporary digital soil mapping is that if soil can be related to other components of the physical environment through a quantitative model, we can use the model to make predictions about soil classes and properties at locations in the landscape where information about those environmental components is available. Ordinarily, we make use of the fact that environmental information can be recorded and represented as a raster grid across the area of interest, which means that we can make predictions of the soil at regular intervals across the same area. When the grid is of a moderate to high resolution—that is to say, when the distance between the centres of adjacent grid cells is small, perhaps 100 m or less—digital soil mapping techniques can provide a detailed spatial coverage of the soil variable of interest.

McBratney et al. (2003) provided a formal framework around which digital soil mapping activities can be constructed. Mathematically, it is denoted:

\[S=f(s,c,o,r,p,a,n)+ε\] In words, it states that soil—its classes or properties, \(S\)—can be considered a function of seven soil-explaining factors plus an associated error term, \(ε\), which may be spatially autocorrelated. The error term is included because statistical models are never perfect representations of reality. Models make predictions about reality but there is always a difference (hopefully small) between the predictions and what we experience in reality.

The scorpan factors represent the different dimensions of the environmental space that affect the soil’s formation and development, and are described briefly in the following table:

Factor Explanation
\(s\) soil
\(c\) climate
\(o\) organisms (generically, the biotic or living component of the environment)
\(r\) relief (topography)
\(p\) parent material
\(a\) age (time)
\(n\) spatial position

When we are presented with an area of interest and asked to produce a digital soil map for it, we may find ourselves in one of three general situations:

  1. We have soil information, \(S\), but no scorpan covariates
  2. We have soil information, \(S\), and a full set of scorpan covariates
  3. We do not have soil information but we do have a full set of scorpan covariates

Nowadays, the first situation is fairly rare in practice, but the methods used to deal with it are foundational and are still especially useful at the farm scale. We’ll deal with the second situation later on. We won’t deal with the third situation in this workshop.

Soil data

The soil data we’ll use in the following exercises were collected in the Hunter Wine Country Private Irrigation District (HWCPID) at Pokolbin in New South Wales, Australia. The dataset consists of 506 observations of pH in the 60–100 cm depth interval.

Generating a prediction grid

It is common practice to make predictions of the target soil property at the nodes of a regular grid. This allows us to model the spatial variation in the target property in more detail than is typically possible in legacy soil mapping, and the data that we produce are well-suited to being stored in a raster format.

Let’s make a prediction grid. First, we’ll load the soil data from the ithir package:

library(ithir)
data(HV_subsoilpH)

The data are stored in a regular data.frame, so we’ll convert the data frame to a simple feature collection:

library(sf)

# Coordinates are in UTM Zone 56 of the Geocentric Datum of Australia 1994 
# (GDA94) coordinate reference system
subsoil_ph <- st_as_sf(HV_subsoilpH,
                       coords = c('X', 'Y'),
                       crs = 28356,
                       remove = F)
head(subsoil_ph)
## Simple feature collection with 6 features and 14 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 340344.9 ymin: 6368491 xmax: 340779.7 ymax: 6369168
## epsg (SRID):    28356
## proj4string:    +proj=utm +zone=56 +south +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##          X       Y pH60_100cm Terrain_Ruggedness_Index     AACN
## 1 340386.4 6368690   4.468124                 1.344260 1.619186
## 2 340344.9 6368491   5.424138                 1.421216 0.281174
## 3 340559.1 6369168   6.255134                 1.636886 2.300552
## 4 340482.6 6368740   8.028147                 1.035425 1.739677
## 5 340734.2 6368964   8.862939                 1.269808 3.113953
## 6 340779.7 6369166   7.284292                 0.734160 2.505943
##   Landsat_Band1 Elevation Hillshading Light_insolation Mid_Slope_Positon
## 1            57 103.13180    1.849483         1689.106          0.876207
## 2            47 103.74100    1.427987         1701.015          0.913833
## 3            59  99.91800    0.934104         1721.861          0.843865
## 4            52 101.90250    1.517358         1688.453          0.848188
## 5            62  99.78325    1.651887         1734.986          0.832627
## 6            53  97.39875    0.759718         1705.626          0.881490
##      MRVBF      NDVI      TWI    Slope                      geometry
## 1 3.851238 -0.142857 17.53831 1.790580 POINT (340386.442 6368689.83)
## 2 3.312562 -0.386139 18.16064 1.417250 POINT (340344.936 6368491.05)
## 3 3.664095 -0.196850 18.79050 1.005097 POINT (340559.118 6369167.98)
## 4 3.918084 -0.140187 18.00043 1.486318 POINT (340482.606 6368740.17)
## 5 3.888233 -0.150000 17.75517 1.829086 POINT (340734.239 6368963.76)
## 6 3.913123 -0.154639 18.72270 0.776578 POINT (340779.725 6369165.58)

In order to save some time, we won’t predict the pH across the whole of the HWCPID; we’ll just predict it across an area slightly larger than the bounding box of the observations. It’s easy to do this using the st_buffer and st_make_grid functions in the sf package:

# Make a rectangular grid of points with 25 m centre spacing and that is 200 m
# larger on all sides than the extent of the observations
nodes <- st_make_grid(st_buffer(subsoil_ph, 200),
                      cellsize = 25,
                      what = "centers")

plot(nodes, axes = TRUE)
plot(subsoil_ph, add = TRUE, col = "red")

You’ll notice that nodes is a simple feature collection, not a simple feature data.frame. To make processing it somewhat easier, we’ll cast it to a SpatialPoints object:

class(nodes)
## [1] "sfc_POINT" "sfc"
nodes <- as(nodes, "Spatial")
class(nodes)
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"

Interpolation

We can think of interpolation as spatial prediction in the absence of any scorpan covariates (more generically, predictor variables). In such cases the predicted value of the target property at any location is solely dependent on the spatial configuration of the set of locations where the target property is actually known. There are several kinds of interpolation, and we’ll take a look at three of them here. In order of increasing complexity, they are nearest-neighbour interpolation, inverse-distance-weighted interpolation, and kriging.

Nearest-neighbour interpolation

Nearest-neighbour interpolation is one of the simplest methods of interpolation in one or more dimensions. In essence, the value of the nearest-neighbour prediction at any location is the value of observation to which it is closest. Nearest-neighbour interpolation is therefore equivalent to Voronoi tesselation.

In practice, we compute the distance between a prediction node and locations of the soil observations as a simple Euclidean distance, \(D_\mathrm{E}\):

\[D_\mathrm{E} = \sqrt{{(x_1 - x_2)}^2 + {(y_1 - y_2)}^2}\] where \((x_1, y_1)\) and \((x_2, y_2)\) are the coordinates of two locations.

We can compute the pairwise distances from one set of points to another using the pointDistance function in the raster package:

library(raster)
## Loading required package: sp
d <- pointDistance(nodes,
                   as(subsoil_ph, "Spatial"),
                   lonlat = F)

The result is a matrix that contains the distances from all the points in nodes to all the points in subsoil_ph. We can confirm this by checking the size of its dimensions against the two inputs:

# Dimensions of distance matrix
dim(d)
## [1] 53760   506
# Number of nodes in prediction grid
nrow(coordinates(nodes))
## [1] 53760
# Number of soil observations
nrow(subsoil_ph)
## [1] 506

Next, we identify and extract the pH value of the observation closest to every prediction node:

# Loop over the rows of d and find the index of the closest observation
closest <- apply(d, 1, function(x) which(x == min(x)))

# Rename the pH column to avoid trouble
colnames(subsoil_ph)[3] <- "ph"

# Extract the pH value of every closest observation
z_nn <- subsoil_ph$ph[closest]

Now we can attach the nearest-neighbour predictions to the prediction nodes.

# Make a copy of the prediction nodes
preds <- nodes

# Create a new column on preds that contains the nearest-neighbour predictions
preds$ph <- z_nn

Finally, we may find it more useful for later analysis to have the results as a raster. You’ll notice that preds is a SpatialPointsDataFrame:

class(preds)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

Before we can convert preds to a raster, we need to let R know that the points are in fact aligned on a regular grid.

gridded(preds) <- TRUE
class(preds)
## [1] "SpatialPixelsDataFrame"
## attr(,"package")
## [1] "sp"

Then we can simply submit preds to the raster function:

preds <- raster(preds)
plot(preds, main = "Nearest-neighbour 60-100 cm pH")

Inverse-distance-weighted interpolation

Let’s get a little more complex. Inverse-distance-weighted (IDW) interpolation makes use of Tobler’s First Law of Geography:

Everything is related to everything else, but near things are more related than distant things. (Tobler 1970)

In IDW interpolation, values at the prediction nodes are calculated as weighted averages of the values at the observation points. The values at the observation points are weighted so that points that are closer to the prediction node carry a higher weight (more importance) in the calculation than observations that are further away. The simplest form of IDW interpolation was devised by Shepard (1968):

\[\hat{z}(\mathbf{x}_0)=\displaystyle\sum_{i=1}^{n}w_i\cdot{}z(\mathbf{x}_i)\]

\[w_i=\frac{D_\mathrm{E}(\mathbf{x}_0, \mathbf{x}_i)^{-k}}{\sum_{j=1}^{n}D_\mathrm{E}(\mathbf{x}_0, \mathbf{x}_j)^{-k}}\]

where:

  • \(\hat{z}(\mathbf{x}_0)\) is the estimate of the target soil property at the location of unobserved point \(\mathbf{x}_0=(x_0, y_0)\)
  • \(z(\mathbf{x}_i)\) are the values of the target soil property at the \(n\) observation points, \(i = 1, 2, ..., n\)
  • \(w_i\) is the weight of location \(i\)
  • \(D_\mathrm{E}(\mathbf{x}_0, \mathbf{x}_i)\) is the Euclidean distance between \(\mathbf{x}_0\) and \(\mathbf{x}_i\)
  • \(k\) is a power parameter, usually set to a value of 2. Larger \(k\) reduces the weight of distant \(\mathbf{x}_i\) and vice-versa.

Validation sample selection

In R, we do interpolation with the gstat package. But first, let’s split our observations into two subsets: one for calibration (model-building) and the other for validation (assessing the model once it’s built). We’ll take 100 randomly-selected observations for the validation dataset and use the remainder for calibration.

# Set the random seed so that you get the same results
set.seed(12345)

# Select 100 rows from subsoil_ph at random
val_rows <- sample(nrow(subsoil_ph), 100)

# Split dataset into two
ph_cal <- subsoil_ph[-val_rows, ]    # calibration
ph_val <- subsoil_ph[val_rows, ]     # validation

# Confirm that the new datasets are the right size
nrow(subsoil_ph)
## [1] 506
nrow(ph_cal)
## [1] 406
nrow(ph_val)
## [1] 100

Interpolation

library(gstat)

# Build a gstat object
# idp is the IDW power parameter
# Need to convert subsoil_ph to a Spatial object on the fly, as sf objects don't
# work with gstat
gs <- gstat(id = "ph", formula = ph ~ 1, data = as(ph_cal, "Spatial"),
            set = list(idp = 2))

# Interpolate onto the prediction grid
z_idw <- predict(gs, nodes)
## [inverse distance weighted interpolation]
# Interpolate onto the validation samples (we'll deal with them later)
ph_val$z_idw <- predict(gs, as(ph_val, "Spatial"))$ph.pred
## [inverse distance weighted interpolation]
# Plot the result
spplot(z_idw["ph.pred"])

Play around with modifying idp. What happens to the predictions when idp is greater than 2? Or less than 2?

Finally, we can convert z to a raster:

gridded(z_idw) <- TRUE
z_idw <- raster(z_idw)
plot(z_idw, main = "Inverse-distance-weighted 60-100 cm pH")

Kriging

You should be able to see that methods like IDW interpolation are improvements over more naiive methods like nearest-neighbour interpolation. Yet they are still theoretically unsatisfactory to many researchers, because they may give biased interpolation, they provide no estimate of the uncertainty in the interpolation, and they do not attempt to minimise the uncertainty in the interpolation (Burgess & Webster 1980).

Kriging is an advanced interpolation technique that mitigates these deficiencies. It was introduced in the 1960s and applied in the mining industry in order to estimate ore reserves, but has since found application in other fields too. Burgess & Webster (1980) introduced it to soil science.

As with IDW interpolation, the kriging estimate of the target soil property at an unobserved location is a weighted average of the values of the property at surrounding, observed, locations. The difference is in how the weights are determined. In IDW interpolation, the weights applied to the computation at any unobserved site are based solely on the distance of the unobserved site to the observed sites. In kriging, the weights are based on not only the distance to the observed sites but also the (modelled) variance in the target soil property between them.

There are many different types of kriging including ordinary kriging, universal kriging, co-kriging, indicator kriging, area-to-point kriging, and so-on. The standard method is ordinary kriging, which we’ll demonstrate here. In addition, kriging may yield punctual or block predictions, based on whether the predictions refer to a single point or to an area. We’ll demonstrate punctual kriging here.

Variograms

A variogram characterises the degree of variance in the soil property value at two locations as a function of the distance between them. This separation distance is commonly called the lag, \(h\).

The semivariance at lag \(h\) is computed as:

\[\hat{\gamma}(h)=\frac{1}{2p(h)}\sum_{i=1}^{p(h)}\{z(\mathbf{x}_i)-z(\mathbf{x}_i+h)\}^2\] where:

  • \(\hat{\gamma}(h)\) is the average per-observation variance (i.e. half the average variance), or semivariance, between pairs of observations separated by lag \(h\)
  • \(p(h)\) is the number of pairs of observations separated by lag \(h\)

We typically compute the semivariance at a sequence of lags. However, unless observations are collected on a regular grid, pairs of observations are not usually spaced at regular intervals. In these situations, the semivariance is computed on pairs of observations that are separated by specific ranges of lag, called the lag tolerance. So for example, we might compute the semivariance for all observations separated by 0–5 m, then 5–10 m, then 10–15m, and so-on.

A plot of \(\hat{\gamma}(h)\) versus \(h\) is called the experimental variogram. The fitted variogram is a mathematical curve fitted to the experimental variogram, and it has several components as identified in the figure below. The nugget semivariance, \(C_0\), describes the proportion of semivariance in the target soil property that occurs over short spatial distances. The sill semivariance, \(C_0 + C_1\), is the maximum semivariance between pairs of observations and the range is the separation distance at which the sill occurs.

If the target soil property is autocorrelated at the scale of observation, we would expect semivariance to increase with lag up to a point. If it is not autocorrelated at the scale of observation, the variogram may be pure nugget, which suggests that nearly all of the spatial variation in the soil property occurs at spatial scales below the minimum separation distance of our observations.

Let’s compute a variogram for the Hunter Valley pH data. As with IDW interpolation, we start by creating a gstat object:

# The same as for IDW interpolation except we don't need the idp parameter
gs <- gstat(id = "ph", formula = ph ~ 1, data = as(ph_cal, "Spatial"))

# Compute the experimental variogram
v <- variogram(gs, width = 50)
head(v)
##    np      dist     gamma dir.hor dir.ver id
## 1  74  25.26596 0.9032000       0       0 ph
## 2 149  78.26633 0.5570700       0       0 ph
## 3 230 126.50961 0.8662826       0       0 ph
## 4 281 175.87957 1.0422123       0       0 ph
## 5 374 224.62046 1.1165812       0       0 ph
## 6 452 275.68160 1.3113348       0       0 ph
plot(v)

Now we can fit a model to the experimental variogram, which is as much an art as it is a science. There are a range of so-called standard models, including exponential, spherical and Gaussian models. Here we’ll fit an exponential model by passing "Exp" as the sole parameter to vgm. If you take a look at the help documentation for vgm (i.e. ?vgm), you’ll see that we could pass plenty more parameters if we wanted to. We could think of several of these parameters as seeds, or initial values, of the variogram nugget, sill and range, which would be refined during the fitting process. In any case:

# Fit an exponential model to the experimental variogram
v_fit <- fit.variogram(v, vgm("Exp"))
v_fit
##   model     psill    range
## 1   Nug 0.7845399   0.0000
## 2   Exp 1.3028544 719.4828
# Plot experimental and fitted variogram together
plot(v, v_fit)

Interpolation

We mentioned previously that as with IDW interpolation, the kriging estimate of the target soil property at an unobserved site is a weighted average of the values at the observed sites:

\[\hat{z}(\mathbf{x}_0)=\displaystyle\sum_{i=1}^{n}\lambda_i\cdot{}z(\mathbf{x}_i)\] where \(\lambda_i\) are the kriging weights; they are constrained so that \(\sum_{i=1}^n\lambda_i = 1\). In matrix notation, the kriging weights for a given prediction location are obtained as \(\lambda=\mathbf{K}^{-1}\mathbf{k}\), where \(\mathbf{K}\) is a matrix of semivariances between each of the observations, and \(\mathbf{k}\) is a vector of the semivariances from the prediction location to the observations.

A by-product of the kriging system is the prediction variance, which is a measure of the uncertainty in the prediction:

\[\hat\sigma^2(\mathbf{x}_0)=\sum_{i=1}^n\lambda_i\gamma(\mathbf{x}_i-\mathbf{x}_0) + \Psi\] where \(\gamma(\mathbf{x}_i-\mathbf{x}_0)\) is the semivariance between observation \(i\) and the prediction location, \(\mathbf{x}_0\), and \(\Psi\) is a Lagrange parameter.

We carry out ordinary kriging with the gstat package in much the same way that we did for IDW interpolation. The major difference, of course, is that we need to specify the fitted variogram using the model argument:

# Build a gstat object
gs <- gstat(id = "ph", formula = ph ~ 1, data = as(ph_cal, "Spatial"),
            model = v_fit)

# Interpolate onto the prediction grid
z_krige <- predict(gs, nodes)
## [using ordinary kriging]
# Interpolate onto the validation samples
ph_val$z_krige <- predict(gs, as(ph_val, "Spatial"))$ph.pred
## [using ordinary kriging]
# Plot the result
spplot(z_krige)

Notice now that the kriging output contains the predictions and also the prediction variance. We can still convert z_krige to a raster, but this time we’ll create a brick. A brick is a collection of raster layers with the same dimensions and grid resolution:

gridded(z_krige) <- TRUE
z_krige <- brick(z_krige)
z_krige
## class       : RasterBrick 
## dimensions  : 256, 210, 53760, 2  (nrow, ncol, ncell, nlayers)
## resolution  : 24.94706, 24.91984  (x, y)
## extent      : 338382, 343620.8, 6364221, 6370600  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=56 +south +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## data source : in memory
## names       :   ph.pred,    ph.var 
## min values  : 4.7314890, 0.8808279 
## max values  :   8.64308,   2.18641
plot(z_krige)

Validation

A proper, detailed treatment of validation is beyond the scope of this workshop, but it’s an important step of any digital soil mapping workflow. Without proper validation of our predictions, further analysis of them may be done under a false impression of their quality (for better or for worse).

The goof function in the ithir package provides several commonly-used validation statistics. These statistics are:

Values of \(R^2\) close to 1.0 denote a strong linear relationship between predicted and observed values. Likewise, values of \(\rho_c\) close to 1.0 denote a strong degree of concordance of that relationship with the 1:1 line. Ideally we want both \(R^2\) and \(\rho_c\) to be close to 1.0—it is no use if \(R^2\) is large but the relationship does not correspond very well with the 1:1 line.

# Validation statistics for the IDW interpolation
goof(ph_val$ph, ph_val$z_idw)
##          R2 concordance      MSE     RMSE      bias
## 1 0.3761796   0.5318004 1.089724 1.043898 0.1441021
# Validation statistics for the kriging interpolation
goof(ph_val$ph, ph_val$z_krige)
##          R2 concordance       MSE      RMSE      bias
## 1 0.4428879   0.5907917 0.9681843 0.9839636 0.1008171
# Scatter plot of predicted versus observed values
library(ggplot2)
ggplot(ph_val, aes(ph, y = predicted, color = method)) +
  geom_abline(colour = "#CCCCCC", linetype = 2) +
  geom_point(aes(y = z_idw, col = "IDW")) +
  geom_point(aes(y = z_krige, col = "kriged")) +
  geom_smooth(aes(y = z_idw, col = "IDW"), method = lm, se = FALSE) +
  geom_smooth(aes(y = z_krige, col = "kriged"), method = lm, se = FALSE) +
  xlab("observed")

References

Burgess TM, Webster R 1980. Optimal interpolation and isarithmic mapping of soil properties: I. The semi-variogram and punctual kriging. Journal of Soil Science 31:315–331. doi:10.1111/j.1365-2389.1980.tb02084.x

Lin LI-K 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics 45:255–268. doi:10.2307/2532051

McBratney AB, Mendonça Santos M de L, Minasny B 2003. On digital soil mapping. Geoderma 117:3–52. doi:10.1016/S0016-7061(03)00223-4

Shepard D 1968. A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM national conference. New York. pp. 517–524. doi:10.1145/800186.810616

Tobler WR 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46:234–240. doi:10.2307/143141