The caret package

In this course we will introduce you to machine learning using the caret package.

There a huge number of different machine learning algorithms available in R. caret makes it easier to work with this multitude of options, and facilitates the day-to-day operations of building a model:

  • it brings a consistent interface to the machine learning methods (consistency) –> unified interface
  • It facilitate model tuning using resampling methods
  • It provides day-to-day, helper functions for inference
  • It facilitates parallel computing

We can start by loading the caret package:

library(caret)

We will also load a demonstration dataset called Sacramento. This dataset contains house sale price data for 932 homes in Sacramento, California.

# load the demo dataset
data("Sacramento", package = "caret")

# Give it a simpler name, and 
# focus on a few variables
sc <- Sacramento[, 3:7]

# Have a quick look
head(sc)
##   beds baths sqft        type price
## 1    2     1  836 Residential 59222
## 2    3     1 1167 Residential 68212
## 3    2     1  796 Residential 68880
## 4    2     1  852 Residential 69307
## 5    2     1  797 Residential 81900
## 6    3     1 1122       Condo 89921

The function glance, from the broom package, is also useful:

library(broom)

glance(sc)
##   nrow ncol complete.obs na.fraction
## 1  932    5          932           0

Splitting a dataset

As said in the machine learning theory section, the simplest form of validation is to set up a holdout dataset.

About randomness and reproducibility

But first, a short word about reproducibility. There is a tension between:

  • Sound science ought to be reproducible
  • To avoid systematic bias we need to use some form of randomness in the way we split a dataset

The function sample takes a random sample from a given vector of values:

sample(x = 1:10, size = 3, replace = FALSE)
## [1]  5  8 10
sample(x = 1:10, size = 3, replace = FALSE)
## [1] 2 8 5

Behind the scenes, it uses a random number generator. It turns out that generating truly random numbers is a rather hard task! Often, random number generators use a seed number (such as the system time) to generate a suite of random numbers.

A way to achieve reproducibility is to choose that seed:

set.seed(29)
sample(x = 1:10, size = 3, replace = FALSE)
## [1]  1  3 10
set.seed(29)
sample(x = 1:10, size = 3, replace = FALSE)
## [1]  1  3 10

Random sampling

As we just learnt, the function sample can pick randomly a suite of elements from a vector.

set.seed(29)

sample(
  x = 1:10, # vector of row indices
  size = 3, # absolute number of elements to pick
  replace = FALSE # sampling with no replacement
)
## [1]  1  3 10

We can use it to pick the indices of rows from our data frame. The expression 1:nrow(sc) creates a vector conting the indices of rows:

head(
  1:nrow(sc)
)
## [1] 1 2 3 4 5 6
set.seed(29)

# Using a percentage of sampled rows
idx_train <- sample(
  x = 1:nrow(sc), 
  size = 0.7 * nrow(sc), # 70% of the number of rows
  replace = FALSE
)

set.seed(29)

# Using an absolute numbers of sampled rows
idx_train <- sample(
  x = 1:nrow(sc), 
  size = 100, 
  replace = FALSE 
)

# The indices of the testing set are the complement of the training samples
idx_test  <- setdiff(1:nrow(sc), idx_train)

Stratified random sampling (based on the outcome)

Stratified random sampling is an alternative to purely random sampling. In this case, we create strata, or bins, in the values we want to predict. We then pick random values in these starta. It ensures the range of values that we want to predict are represented in our training dataset:

set.seed(29)

idx_train <- createDataPartition(
  y = log10(sc$price), # What we want to predict 
  p = 0.7, # The percentage of rows we want to pick
  list = FALSE # Simplifies the output of the function
)

Stratified random sampling (based on the predictors)

More commonly in digital soil mapping, such stratification is done on the predictors, when trying to pick sampling locations.

A popular approach is the conditioned Latin Hypercube sampling approach proposed by Minasny and McBratney (2006), and implemented in the clhs package:

library(clhs)

idx_train <- clhs(sc, size = 10, iter = 100, progress = FALSE)

idx_train
##  [1]  75 741 174 773 677 171 886 369 103 212

The train function

The train function is the working horse of the caret package. It can be used to:

  • evaluate a range of possible values for the model’s tuning parameters (if any)
  • choose the optimal set of parameters (if any)
  • fit the model using this optimal set of parameters
  • estimate the performace of that final model on the training set

There are two ways to specify a model:

The formula interface

Formulas are a special type of object in R:

my_formula <- price ~ beds + baths
my_formula
## price ~ beds + baths
class(my_formula)
## [1] "formula"

It is used in conjonction with the option data:

  • The formula specifies relationships amongst the variables
    • the outcome and predictors are separated by ~
    • the special character . is a shorthand for “all other variables”
    • you can apply functions within the formula: log(price) ~ beds + baths
  • The dataset on which this formula applies is pecified in the data option
fit <- train(
  form = outcome ~ .,
  data = my_data_frame
)

Let’s try a real formula on our Sacramento dataset.

To keep computing time low, let’s select a small number of rows for training, and split our dataset into training and testing sets:

set.seed(29)

idx_train <- sample(1:nrow(sc), size = 50)

training <- sc[ idx_train, ]
testing  <- sc[ -idx_train, ]
fit <- train(
  form = log(price) ~ .,
  data = training
)

print(fit)
## Random Forest 
## 
## 50 samples
##  4 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##   2     0.4340780  0.5264981
##   3     0.4283400  0.5346191
##   5     0.4322348  0.5270545
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 3.

The matrix interface

The other way to specify a model is through the so-called “matrix interface”. It specifies “x” and “y” elements. In machine learning, y = f(x):

  • X” is the matrix of predictors
  • Y” is the outcome
fit <- train(
  x = my_data_frame[, -idx_outcome], # we remove the outcome variable from the predictors!
  y = my_data_frame$idx_outcome
)

In our example, the outcome, price, is the fifth column:

fit <- train(
  x = training[, -5],
  y = log(training$price)
)

print(fit)
## Random Forest 
## 
## 50 samples
##  4 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##   2     0.4083810  0.5331677
##   3     0.4103087  0.5274945
##   4     0.4131155  0.5223090
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 2.

Tweaking the train function

So far we have only used the default settings of the train function:

  • it fits a Random Forest model
  • it tests 3 different values of the mtry parameter of the Random Forest algorithm
  • it uses a resampling technique called Bootstrapping, repeated 25 times, to pick the best value of mtry
  • it uses RMSE to pick the best parameter value

All of this can obviously be changed.

Choosing an prediction method

The first thing we can change is the prediction method. The default is Random Forest (method = "rf"), but there is a great variety of models currently available.

  • The models can be listed using the nodelLookup function,
  • They are also listed online.

Let’s try two different models, the CART (method = "rpart"), and the Generalised Boosted Regression model (GBM, `method = “gbm”):

fit_rpart <- train(
  x = training[, -5],
  y = log(training$price),
  method = "rpart"
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
print(fit_rpart)
## CART 
## 
## 50 samples
##  4 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared 
##   0.01658818  0.4290840  0.5105141
##   0.22664345  0.4931998  0.3627570
##   0.47838772  0.5562199  0.2740922
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was cp = 0.01658818.
fit_gbm <- train(
  x = training[, -5],
  y = log(training$price),
  method = "gbm",
  verbose = FALSE # passed to the GBM model
)

print(fit_gbm)
## Stochastic Gradient Boosting 
## 
## 50 samples
##  4 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared 
##   1                   50      0.4722140  0.4450936
##   1                  100      0.4695985  0.4434160
##   1                  150      0.4730262  0.4384338
##   2                   50      0.4703908  0.4414845
##   2                  100      0.4715842  0.4364455
##   2                  150      0.4751569  0.4326251
##   3                   50      0.4676533  0.4471762
##   3                  100      0.4692567  0.4488542
##   3                  150      0.4709204  0.4428632
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Resampling methods

The main control parameters for the way the train function is operated is done through a dedicated function: trainControl. Let’s try to set up a suite of training parameters so to use repeated cross-validation, rather than bootstraping:

fit_control <- trainControl(
  method = "repeatedcv",
  number = 10, # number of folds
  repeats = 15 # number of repeats
)

If you print it, you’ll find that this is essentially just a list of parameters.

The output of trainControl can then be fed to the train function via its trControl option:

fit_rf <- train(
  x = training[, -5],
  y = log(training$price),
  method = "rf",
  trControl = fit_control
)

fit_rf
## Random Forest 
## 
## 50 samples
##  4 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 45, 44, 46, 46, 45, 46, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##   2     0.3825629  0.6253238
##   3     0.3795611  0.6231092
##   4     0.3798342  0.6201493
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 3.

Estimating model parameters

The resampling strategy is used, amongst other reasons, to estimate what the best set of parameters is. train will try to guess the number and values of the parameters it should try, but it is always better to specify it manually.

There are two ways to do this:

  • tuneLength allows you to specify the length of the parameter grid:
 fit_gbm <- train(
  x = training[, -5],
  y = log(training$price),
  method = "gbm",
  trControl = fit_control,
  tuneLength = 2, # Just try 2 values for each param
  verbose = FALSE
)

fit_gbm
## Stochastic Gradient Boosting 
## 
## 50 samples
##  4 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 15 times) 
## Summary of sample sizes: 43, 45, 44, 45, 46, 46, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared 
##   1                   50      0.4280977  0.5689799
##   1                  100      0.4256632  0.5758940
##   2                   50      0.4300128  0.5714795
##   2                  100      0.4250961  0.5797746
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
plot(fit_gbm)

  • tuneGrid allows you to specify your own grid of parameters. You can create it using the expand.grid function:
cubist_tune_grid <- expand.grid(
  committees = 1:3,
  neighbors = c(1, 5)
)

cubist_tune_grid
##   committees neighbors
## 1          1         1
## 2          2         1
## 3          3         1
## 4          1         5
## 5          2         5
## 6          3         5
fit_cubist <- train(
  x = training[, -5],
  y = log(training$price),
  method = "cubist",
  trControl = fit_control,
  tuneGrid = cubist_tune_grid
)

fit_cubist
## Cubist 
## 
## 50 samples
##  4 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 43, 46, 45, 45, 45, 46, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared 
##   1           1          0.3683400  0.6716137
##   1           5          0.3593615  0.6860659
##   2           1          0.3726992  0.6630470
##   2           5          0.3478882  0.6964855
##   3           1          0.3673187  0.6707442
##   3           5          0.3540748  0.6948764
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were committees = 2 and neighbors = 5.
plot(fit_cubist)

Compare models

Since we can change the behaviour of train, we can create several models and compare them.

To do so, we will first extract the predictions of the trained models on the training and the testing sets:

library(randomForest)
library(Cubist)
library(rpart)
library(gbm)

predictions <- extractPrediction(
  list(fit_rf, fit_cubist, fit_rpart, fit_gbm),
  testX = testing[, -5],
  testY = log(testing$price)
)

summary(predictions)
##       obs             pred          model         dataType   
##  Min.   :10.31   Min.   :11.31   cubist:932   Test    :3528  
##  1st Qu.:11.96   1st Qu.:11.91   gbm   :932   Training: 200  
##  Median :12.30   Median :12.31   rf    :932                  
##  Mean   :12.28   Mean   :12.27   rpart :932                  
##  3rd Qu.:12.63   3rd Qu.:12.49                               
##  Max.   :13.69   Max.   :13.72                               
##      object   
##  Object1:932  
##  Object2:932  
##  Object3:932  
##  Object4:932  
##               
## 

There is then a built-in function to create a plot:

plotObsVsPred(predictions)

For those that enjoyed ggplot2:

library(ggplot2)
library(hrbrthemes)

# Attach the dataset to a plot
ggplot(data = predictions) +
  # First layer: the points
  geom_point(
    aes(x = pred, y = obs),
    # A bit of bling: colour and transparency
    colour = "royalblue", 
    alpha = 0.5
  ) +
  # The 1:1 line
  geom_abline(
    slope = 1, 
    intercept = 0, 
    linetype = 2, 
    colour = "darkred"
  ) +
  # 2-D faceting based on training/testing and model
  facet_grid(dataType ~ model) +
  # Labels
  labs(
    x = "Predicted", 
    y = "Observed", 
    title = "Comparison of four prediction models", 
    subtitle = "Results in log-transformed space"
  ) +
  # A nice theme from the hrbrthemes package
  theme_ipsum()

Calculate performance of the prediction models trained

The resampling results can be directly accessed from the output of the train function:

fit_rf$results
##   mtry      RMSE  Rsquared    RMSESD RsquaredSD
## 1    2 0.3825629 0.6253238 0.1435205  0.2846817
## 2    3 0.3795611 0.6231092 0.1575815  0.2958324
## 3    4 0.3798342 0.6201493 0.1661327  0.2960415

The predictions extracted previsouly can be used to compute the performance of the final models on both the training and the testing datasets.

The caret package provides built-in functions to compute R-squared and RMSE (named R2 and RMSE):

library(dplyr)

predictions %>% 
  group_by(model, dataType) %>% 
  dplyr::summarise(
    rmse = RMSE(pred = pred, obs = obs),
    rsq = R2(pred = pred, obs = obs)
  )
## # A tibble: 8 x 4
## # Groups:   model [?]
##    model dataType      rmse       rsq
##   <fctr>   <fctr>     <dbl>     <dbl>
## 1 cubist     Test 0.3622657 0.5440174
## 2 cubist Training 0.2725183 0.7852327
## 3    gbm     Test 0.3748376 0.4944235
## 4    gbm Training 0.3713505 0.5984901
## 5     rf     Test 0.3809308 0.5221704
## 6     rf Training 0.2209736 0.8651770
## 7  rpart     Test 0.3997982 0.4937117
## 8  rpart Training 0.3154243 0.7050312

Alternatively, we can use the goof function from the ithir package. It retruns values for the most commoinly used performance indicators.

Here it is used in conjunction with the do verb, which applies a given function to a group of data:

library(ithir)

predictions %>% 
  group_by(model, dataType) %>% 
  do(
    goof(
      observed = .$obs, 
      predicted = .$pred
    )
  )
## # A tibble: 8 x 7
## # Groups:   model, dataType [8]
##    model dataType        R2 concordance        MSE      RMSE          bias
##   <fctr>   <fctr>     <dbl>       <dbl>      <dbl>     <dbl>         <dbl>
## 1 cubist     Test 0.5434992   0.7293982 0.13123642 0.3622657 -3.593922e-02
## 2 cubist Training 0.7807584   0.8500353 0.07426622 0.2725183 -3.378735e-04
## 3    gbm     Test 0.4938490   0.6738946 0.14050321 0.3748376  4.575392e-02
## 4    gbm Training 0.5901254   0.7080679 0.13790120 0.3713505  1.328794e-03
## 5     rf     Test 0.5216274   0.7179621 0.14510824 0.3809308 -4.338545e-02
## 6     rf Training 0.8623682   0.8961210 0.04882932 0.2209736 -7.848823e-03
## 7  rpart     Test 0.4931363   0.7008918 0.15983857 0.3997982 -2.572342e-02
## 8  rpart Training 0.6988860   0.8104609 0.09949251 0.3154243 -1.776357e-15

Predict values

Predicted values can be generated from the final model using the predict function:

preds <- predict(
  fit_rf,
  testing
)

head(preds)
##        1        3        4        5        6        7 
## 11.39979 11.48428 11.39979 11.48428 11.52607 11.59391

For more information