caret
packagecaret
packageIn this course we will introduce you to machine learning using the caret
package.
There a huge number of different machine learning algorithms available in R. caret
makes it easier to work with this multitude of options, and facilitates the day-to-day operations of building a model:
We can start by loading the caret
package:
library(caret)
We will also load a demonstration dataset called Sacramento
. This dataset contains house sale price data for 932 homes in Sacramento, California.
# load the demo dataset
data("Sacramento", package = "caret")
# Give it a simpler name, and
# focus on a few variables
sc <- Sacramento[, 3:7]
# Have a quick look
head(sc)
## beds baths sqft type price
## 1 2 1 836 Residential 59222
## 2 3 1 1167 Residential 68212
## 3 2 1 796 Residential 68880
## 4 2 1 852 Residential 69307
## 5 2 1 797 Residential 81900
## 6 3 1 1122 Condo 89921
The function glance
, from the broom
package, is also useful:
library(broom)
glance(sc)
## nrow ncol complete.obs na.fraction
## 1 932 5 932 0
As said in the machine learning theory section, the simplest form of validation is to set up a holdout dataset.
But first, a short word about reproducibility. There is a tension between:
The function sample
takes a random sample from a given vector of values:
sample(x = 1:10, size = 3, replace = FALSE)
## [1] 5 8 10
sample(x = 1:10, size = 3, replace = FALSE)
## [1] 2 8 5
Behind the scenes, it uses a random number generator. It turns out that generating truly random numbers is a rather hard task! Often, random number generators use a seed number (such as the system time) to generate a suite of random numbers.
A way to achieve reproducibility is to choose that seed:
set.seed(29)
sample(x = 1:10, size = 3, replace = FALSE)
## [1] 1 3 10
set.seed(29)
sample(x = 1:10, size = 3, replace = FALSE)
## [1] 1 3 10
As we just learnt, the function sample
can pick randomly a suite of elements from a vector.
set.seed(29)
sample(
x = 1:10, # vector of row indices
size = 3, # absolute number of elements to pick
replace = FALSE # sampling with no replacement
)
## [1] 1 3 10
We can use it to pick the indices of rows from our data frame. The expression 1:nrow(sc)
creates a vector conting the indices of rows:
head(
1:nrow(sc)
)
## [1] 1 2 3 4 5 6
set.seed(29)
# Using a percentage of sampled rows
idx_train <- sample(
x = 1:nrow(sc),
size = 0.7 * nrow(sc), # 70% of the number of rows
replace = FALSE
)
set.seed(29)
# Using an absolute numbers of sampled rows
idx_train <- sample(
x = 1:nrow(sc),
size = 100,
replace = FALSE
)
# The indices of the testing set are the complement of the training samples
idx_test <- setdiff(1:nrow(sc), idx_train)
Stratified random sampling is an alternative to purely random sampling. In this case, we create strata, or bins, in the values we want to predict. We then pick random values in these starta. It ensures the range of values that we want to predict are represented in our training dataset:
set.seed(29)
idx_train <- createDataPartition(
y = log10(sc$price), # What we want to predict
p = 0.7, # The percentage of rows we want to pick
list = FALSE # Simplifies the output of the function
)
More commonly in digital soil mapping, such stratification is done on the predictors, when trying to pick sampling locations.
A popular approach is the conditioned Latin Hypercube sampling approach proposed by Minasny and McBratney (2006), and implemented in the clhs
package:
library(clhs)
idx_train <- clhs(sc, size = 10, iter = 100, progress = FALSE)
idx_train
## [1] 75 741 174 773 677 171 886 369 103 212
train
functionThe train
function is the working horse of the caret
package. It can be used to:
There are two ways to specify a model:
Formulas are a special type of object in R:
my_formula <- price ~ beds + baths
my_formula
## price ~ beds + baths
class(my_formula)
## [1] "formula"
It is used in conjonction with the option data
:
~
.
is a shorthand for “all other variables”log(price) ~ beds + baths
data
optionfit <- train(
form = outcome ~ .,
data = my_data_frame
)
Let’s try a real formula on our Sacramento dataset.
To keep computing time low, let’s select a small number of rows for training, and split our dataset into training and testing sets:
set.seed(29)
idx_train <- sample(1:nrow(sc), size = 50)
training <- sc[ idx_train, ]
testing <- sc[ -idx_train, ]
fit <- train(
form = log(price) ~ .,
data = training
)
print(fit)
## Random Forest
##
## 50 samples
## 4 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.4340780 0.5264981
## 3 0.4283400 0.5346191
## 5 0.4322348 0.5270545
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
The other way to specify a model is through the so-called “matrix interface”. It specifies “x” and “y” elements. In machine learning, y = f(x)
:
fit <- train(
x = my_data_frame[, -idx_outcome], # we remove the outcome variable from the predictors!
y = my_data_frame$idx_outcome
)
In our example, the outcome, price
, is the fifth column:
fit <- train(
x = training[, -5],
y = log(training$price)
)
print(fit)
## Random Forest
##
## 50 samples
## 4 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.4083810 0.5331677
## 3 0.4103087 0.5274945
## 4 0.4131155 0.5223090
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
train
functionSo far we have only used the default settings of the train
function:
mtry
parameter of the Random Forest algorithmmtry
All of this can obviously be changed.
The first thing we can change is the prediction method. The default is Random Forest (method = "rf"
), but there is a great variety of models currently available.
nodelLookup
function,Let’s try two different models, the CART (method = "rpart"
), and the Generalised Boosted Regression model (GBM, `method = “gbm”):
fit_rpart <- train(
x = training[, -5],
y = log(training$price),
method = "rpart"
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
print(fit_rpart)
## CART
##
## 50 samples
## 4 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared
## 0.01658818 0.4290840 0.5105141
## 0.22664345 0.4931998 0.3627570
## 0.47838772 0.5562199 0.2740922
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01658818.
fit_gbm <- train(
x = training[, -5],
y = log(training$price),
method = "gbm",
verbose = FALSE # passed to the GBM model
)
print(fit_gbm)
## Stochastic Gradient Boosting
##
## 50 samples
## 4 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 50, 50, 50, 50, 50, 50, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared
## 1 50 0.4722140 0.4450936
## 1 100 0.4695985 0.4434160
## 1 150 0.4730262 0.4384338
## 2 50 0.4703908 0.4414845
## 2 100 0.4715842 0.4364455
## 2 150 0.4751569 0.4326251
## 3 50 0.4676533 0.4471762
## 3 100 0.4692567 0.4488542
## 3 150 0.4709204 0.4428632
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth
## = 3, shrinkage = 0.1 and n.minobsinnode = 10.
The main control parameters for the way the train
function is operated is done through a dedicated function: trainControl
. Let’s try to set up a suite of training parameters so to use repeated cross-validation, rather than bootstraping:
fit_control <- trainControl(
method = "repeatedcv",
number = 10, # number of folds
repeats = 15 # number of repeats
)
If you print it, you’ll find that this is essentially just a list of parameters.
The output of trainControl
can then be fed to the train
function via its trControl
option:
fit_rf <- train(
x = training[, -5],
y = log(training$price),
method = "rf",
trControl = fit_control
)
fit_rf
## Random Forest
##
## 50 samples
## 4 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 45, 44, 46, 46, 45, 46, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.3825629 0.6253238
## 3 0.3795611 0.6231092
## 4 0.3798342 0.6201493
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
The resampling strategy is used, amongst other reasons, to estimate what the best set of parameters is. train
will try to guess the number and values of the parameters it should try, but it is always better to specify it manually.
There are two ways to do this:
tuneLength
allows you to specify the length of the parameter grid: fit_gbm <- train(
x = training[, -5],
y = log(training$price),
method = "gbm",
trControl = fit_control,
tuneLength = 2, # Just try 2 values for each param
verbose = FALSE
)
fit_gbm
## Stochastic Gradient Boosting
##
## 50 samples
## 4 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 15 times)
## Summary of sample sizes: 43, 45, 44, 45, 46, 46, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared
## 1 50 0.4280977 0.5689799
## 1 100 0.4256632 0.5758940
## 2 50 0.4300128 0.5714795
## 2 100 0.4250961 0.5797746
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
plot(fit_gbm)
tuneGrid
allows you to specify your own grid of parameters. You can create it using the expand.grid
function:cubist_tune_grid <- expand.grid(
committees = 1:3,
neighbors = c(1, 5)
)
cubist_tune_grid
## committees neighbors
## 1 1 1
## 2 2 1
## 3 3 1
## 4 1 5
## 5 2 5
## 6 3 5
fit_cubist <- train(
x = training[, -5],
y = log(training$price),
method = "cubist",
trControl = fit_control,
tuneGrid = cubist_tune_grid
)
fit_cubist
## Cubist
##
## 50 samples
## 4 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 43, 46, 45, 45, 45, 46, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared
## 1 1 0.3683400 0.6716137
## 1 5 0.3593615 0.6860659
## 2 1 0.3726992 0.6630470
## 2 5 0.3478882 0.6964855
## 3 1 0.3673187 0.6707442
## 3 5 0.3540748 0.6948764
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 2 and neighbors = 5.
plot(fit_cubist)
Since we can change the behaviour of train
, we can create several models and compare them.
To do so, we will first extract the predictions of the trained models on the training and the testing sets:
library(randomForest)
library(Cubist)
library(rpart)
library(gbm)
predictions <- extractPrediction(
list(fit_rf, fit_cubist, fit_rpart, fit_gbm),
testX = testing[, -5],
testY = log(testing$price)
)
summary(predictions)
## obs pred model dataType
## Min. :10.31 Min. :11.31 cubist:932 Test :3528
## 1st Qu.:11.96 1st Qu.:11.91 gbm :932 Training: 200
## Median :12.30 Median :12.31 rf :932
## Mean :12.28 Mean :12.27 rpart :932
## 3rd Qu.:12.63 3rd Qu.:12.49
## Max. :13.69 Max. :13.72
## object
## Object1:932
## Object2:932
## Object3:932
## Object4:932
##
##
There is then a built-in function to create a plot:
plotObsVsPred(predictions)
For those that enjoyed ggplot2
:
library(ggplot2)
library(hrbrthemes)
# Attach the dataset to a plot
ggplot(data = predictions) +
# First layer: the points
geom_point(
aes(x = pred, y = obs),
# A bit of bling: colour and transparency
colour = "royalblue",
alpha = 0.5
) +
# The 1:1 line
geom_abline(
slope = 1,
intercept = 0,
linetype = 2,
colour = "darkred"
) +
# 2-D faceting based on training/testing and model
facet_grid(dataType ~ model) +
# Labels
labs(
x = "Predicted",
y = "Observed",
title = "Comparison of four prediction models",
subtitle = "Results in log-transformed space"
) +
# A nice theme from the hrbrthemes package
theme_ipsum()
The resampling results can be directly accessed from the output of the train
function:
fit_rf$results
## mtry RMSE Rsquared RMSESD RsquaredSD
## 1 2 0.3825629 0.6253238 0.1435205 0.2846817
## 2 3 0.3795611 0.6231092 0.1575815 0.2958324
## 3 4 0.3798342 0.6201493 0.1661327 0.2960415
The predictions extracted previsouly can be used to compute the performance of the final models on both the training and the testing datasets.
The caret
package provides built-in functions to compute R-squared and RMSE (named R2
and RMSE
):
library(dplyr)
predictions %>%
group_by(model, dataType) %>%
dplyr::summarise(
rmse = RMSE(pred = pred, obs = obs),
rsq = R2(pred = pred, obs = obs)
)
## # A tibble: 8 x 4
## # Groups: model [?]
## model dataType rmse rsq
## <fctr> <fctr> <dbl> <dbl>
## 1 cubist Test 0.3622657 0.5440174
## 2 cubist Training 0.2725183 0.7852327
## 3 gbm Test 0.3748376 0.4944235
## 4 gbm Training 0.3713505 0.5984901
## 5 rf Test 0.3809308 0.5221704
## 6 rf Training 0.2209736 0.8651770
## 7 rpart Test 0.3997982 0.4937117
## 8 rpart Training 0.3154243 0.7050312
Alternatively, we can use the goof
function from the ithir
package. It retruns values for the most commoinly used performance indicators.
Here it is used in conjunction with the do
verb, which applies a given function to a group of data:
library(ithir)
predictions %>%
group_by(model, dataType) %>%
do(
goof(
observed = .$obs,
predicted = .$pred
)
)
## # A tibble: 8 x 7
## # Groups: model, dataType [8]
## model dataType R2 concordance MSE RMSE bias
## <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cubist Test 0.5434992 0.7293982 0.13123642 0.3622657 -3.593922e-02
## 2 cubist Training 0.7807584 0.8500353 0.07426622 0.2725183 -3.378735e-04
## 3 gbm Test 0.4938490 0.6738946 0.14050321 0.3748376 4.575392e-02
## 4 gbm Training 0.5901254 0.7080679 0.13790120 0.3713505 1.328794e-03
## 5 rf Test 0.5216274 0.7179621 0.14510824 0.3809308 -4.338545e-02
## 6 rf Training 0.8623682 0.8961210 0.04882932 0.2209736 -7.848823e-03
## 7 rpart Test 0.4931363 0.7008918 0.15983857 0.3997982 -2.572342e-02
## 8 rpart Training 0.6988860 0.8104609 0.09949251 0.3154243 -1.776357e-15
Predicted values can be generated from the final model using the predict
function:
preds <- predict(
fit_rf,
testing
)
head(preds)
## 1 3 4 5 6 7
## 11.39979 11.48428 11.39979 11.48428 11.52607 11.59391