Predictive modeling and machine learning in R with the caret package

The intention of this post is to highlight some of the great core features of caret for machine learning and point out some subtleties and tweaks that can help you take full advantage of the package. In the process of introducing caret we’re also pointing out important misunderstandings about modeling in general and caret specifically based on our experience teaching a course on predictive modeling with caret to hundreds of students.

Table of contents

Powerful and simplified modeling with caret

The R caret package will make your modeling life easier – guaranteed. caret allows you to test out different models with very little change to your code and throws in near-automatic cross validation-bootstrapping and parameter tuning for free.

For example, below we show two nearly identical lines of code. Yet they run entirely different models. In the first, method = "lm" tells caret to run a traditional linear regression model. In the second line method = "rf" tells caret to run a random forest model using the same data. One tiny syntax change and you run an entirely new type of model.

library(caret)
# In R, the annual_pm~. means use annual_pm as the model response
# and use all other variables as predictors
lm1 <- train(annual_pm~., data = air, method = "lm")
rf1 <- train(annual_pm~., data = air, method = "rf")

But this is only a taste of the power of the caret package, created by Max Kuhn who now works for RStudio. Behind the scenes caret takes these lines of code and automatically resamples the models and conducts parameter tuning (more on this later). This enables you to build and compare models with very little overhead.

Streamlined and consistent syntax for more than 200 different models

As mentioned above, one of the most powerful aspects of the caret package is the consistent modeling syntax. By simply changing the method argument, you can easily cycle between, for example, running a linear model, a gradient boosting machine model and a LASSO model. In total, there are 233 different models available in caret. This blog post will focus on regression-type models (those with a continuous outcome), but classification models are also easily applied in caret using the same basic syntax.

It’s important to note that behind the scenes, caret is not actually performing the statistics/modeling – this job is left to individual R packages. For example, when you run the two lines of code above caret uses the lm() function from the stats package to compute the linear regression model and the randomForest() function from the randomForest package for the random forest model.

Run all models with the train() function

The beauty of having caret provide the vehicle to run these models is that you can use exactly the same function, train(), to run all of the models. The train() function accepts several caret-specific arguments and you can also provide arguments that get fed to the underlying modeling package/function.

Running the train() function creates a train object, essentially a list, with a ton of useful results. Here you see that the object caret creates has class train and that it has 23 attributes:

lm1 <- train(annual_pm~., data = air, method = "lm")
class(lm1)
## [1] "train"         "train.formula"
attributes(lm1)
## $names
##  [1] "method"       "modelInfo"    "modelType"    "results"     
##  [5] "pred"         "bestTune"     "call"         "dots"        
##  [9] "metric"       "control"      "finalModel"   "preProcess"  
## [13] "trainingData" "resample"     "resampledCM"  "perfNames"   
## [17] "maximize"     "yLimits"      "times"        "levels"      
## [21] "terms"        "coefnames"    "xlevels"     
## 
## $class
## [1] "train"         "train.formula"

Among these attributes is the finalModel which is the final model from the underlying package used by caret. So for the lm1 object we created above the final model is simply a linear regression using the full input dataset. For rf1 it’s the final random forest model.

lm1$finalModel
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
##    (Intercept)        latitude       longitude         Kvol100  
## -255.691658494     1.702674197    -2.612063511     0.001096599  
##        pop1000        oil6_500        truck300         road100  
##    0.000000143     0.008935559     0.458994434     0.084027878  
##       p_ind300      p_bldg1000        imperv50  
##   32.966423769     0.291760773     0.016474590
rf1$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 2.042014
##                     % Var explained: 58.5

Easy data splitting

A common approach in machine learning, assuming you have enough data, is to split your data into a training dataset for model building and testing dataset for final model testing. The model building, with the help of resampling, would be conducted only on the training dataset. Once a final model has been chosen you would do an additional test of performance by predicting the response in the testing set.

For simple random splitting (or stratified random splitting) you can make use of the createDataPartition() function. You feed the function your outcome, a proportion and specify whether you want the output as a list:

samp <- createDataPartition(air$annual_pm, p = 0.8, list = FALSE)
training <- air[samp,]
testing <- air[-samp,]

In this example, our outcome is annual air pollution levels (pm refers to particulate matter) which is a continuous outcome in which case the function will randomly split the data based on quartiles of the data. If the first argument to createDataPartition() is categorical caret will perform stratified random sampling on the variable levels.

The 0.8 specifies we want the training dataset to be 80% of the total records and here we want don’t want list output, we want a matrix.

The createDataPartition() function returns the index number of the observations to be included in the training dataset and this index number can be used to create your training and testing datasets as we’ve done above.

In this post we’re using a relatively small dataset so we will skip the data splitting in favor of resampling only but data splitting would normally be an important part of the process.

Realistic model estimates through built-in resampling

The ultimate goal of a predictive model is to predict data you’ve never seen before. Your modeling dataset should be representative of other datasets you’re likely to see but the goal is not to “predict” the data you have in hand, but to develop a model that will predict new datasets.

The example data we’re using in this post is an air pollution dataset we assembled from a variety of sources in NYC including the amazing New York City Community Air Survey data from the NYC Department of Health. In this dataset we have actual air quality measurements as well as candidate predictor variables on, for example, traffic or industrial land use nearby. We can use the relationship between air pollution and, for example, traffic to develop a model. But, as we mentioned before, we don’t need to predict the data we have in hand – we don’t need to predict at locations where we have actual measurements – we need a model that can predict at unmonitored locations. A model R-squared of 99% is not helpful if we can’t predict at new locations.

An R-squared from a model based on the full dataset is unrealistic

One approach to the development of a predictive model might be to simply identify good predictors and an appropriate model type based on your full dataset. You might then use R-squared or root mean square error to evaluate model performance and you might assume that the R-squared you see in your full model would be what you would see when you predict a new dataset.

Unfortunately, an R-squared based on modeling your full dataset is not a realistic measure of how well you’re likely to perform on a new dataset. Your R-squared from predictions on a new dataset is almost certainly going to be lower. This is due to the fact that with the R-squared based on the full model you are essentially re-predicting the same data you used to build the model in the first place.

An R-squared based on resampling is more realistic

To get a more realistic sense of how well you will predict a new dataset you can “pretend” some of your data is a new dataset and set that data aside. Then you can develop a model with the rest of the data then predict the unmodeled data and see how you did. This would be similar to splitting into training and testing datasets.

Doing this once, though, will only test one possible data split. What if the data you set aside has all of the outliers or none of the outliers? Instead, you can do this over and over again. This process, known as resampling, will give a more realistic sense of how well your model will perform on new data.

Manually programming this in R would definitely be a chore. Fortunately, the caret package has built-in resampling. Remember this code?

library(caret)
lm1 <- train(annual_pm~., data = air, method = "lm")
rf1 <- train(annual_pm~., data = air, method = "rf")

If you type lm1$finalModel as we did above you will get the results from a model based on the full dataset. And if you look at the R-squared from this final model with:

# Unrealistic R-squared based on final model
summary(lm1$finalModel)$r.squared
## [1] 0.8325875

you’ll see that the overall R-squared (based on all the data) is 0.83. This is the unrealistic R-squared. It’s worth repeating, this is an R-squared from a model using all the data when what you want instead is an R-squared based on applying your model to new data which is what you get with resampling.

To see the “realistic” R-squared based on resampling – based on repeatedly building the model and applying to new data – you can look directly at the train object (not the finalModel attribute). By typing lm1 you get a brief summary of the resampling in the train object:

# Realistic R-squared based on resampling
lm1
## Linear Regression 
## 
## 63 samples
## 10 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 63, 63, 63, 63, 63, 63, ... 
## Resampling results:
## 
##   RMSE      Rsquared 
##   1.343919  0.6863549
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

You can see that in this case the resampling (realistic) R-squared is lower than the “unrealistic” R-squared we saw from the final model (lm1$finalModel). Keep in mind two things:

  1. This dataset is small so the difference between unrealistic and realistic R-squared is higher than you would likely experience with a larger dataset.
  2. Although a lower R-squared can be disappointing, it is a more defensible and realistic measure of your model’s likely performance on new data.

The caret package does automatic resampling — but what is it doing exactly?

The output from lm1 above tells you that to compute the realistic R-squared and RMSE caret used bootstrap resampling with 25 repetitions – this is the default resampling approach in caret. The process caret used was:

  1. Randomly sample the data with replacement. This means that a single observation could be chosen more than once. The total size of the modeling dataset will be the same as the original dataset but some sites will not be included and some will be included more than once.
  2. Develop a model based on the randomly sampled sites only.
  3. Predict the withheld sites (also known as “out-of-bag” (OOB) observations) and compute an R-squared based on the predictions.
  4. Repeat the process (25 models are run) and average the R-squared values.
  5. When the resampling is done, caret then runs a final model on the full dataset and stores this in finalModel. So, in this example, 25 + 1 models are run, 25 bootstraps and one model run using full dataset.

You can extract the resample-based R-squared and RMSE with:

# This table has 1 row
lm1$results
##   intercept     RMSE  Rsquared    RMSESD RsquaredSD
## 1      TRUE 1.343919 0.6863549 0.2600146 0.07564842
# This table has 3 rows, we explain later
rf1$results
##   mtry     RMSE  Rsquared    RMSESD RsquaredSD
## 1    2 1.411025 0.6409591 0.2864182 0.09225469
## 2    6 1.384191 0.6364332 0.2601817 0.09386143
## 3   10 1.423573 0.6112362 0.2553959 0.11529844

Bootstrap is the default resampling approach but you can easily use cross validation instead

By default caret will bootstrap your data 25 times. But this can be computationally expensive if you have a large dataset (25 + 1 model runs in this case). Although caret will perform the resampling in parallel on multiple cores if possible, with a large dataset you might consider a 10-fold cross validation instead. Essentially this means splitting your data into 10 approximately equal chunks. You then develop a model based on 9 chunks and predict the 10th. Then choose a different 9 and predict and so on.

caret can take care of the dirty work of setting up the resampling with the help of the trainControl() function and the trControl argument. Careful here, the argument is trControl not trainControl. The trainControl() function allows you to set up several aspects of your model including the resampling method.

# Set up a 10-fold cross validation
tc <- trainControl(method = "cv", number = 10)
# Include the setup in your model
lm1_cv <- train(annual_pm~., data = air, method = "lm",
             trControl = tc) # here
lm1_cv
## Linear Regression 
## 
## 63 samples
## 10 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 55, 58, 55, 57, 56, 58, ... 
## Resampling results:
## 
##   RMSE     Rsquared 
##   1.06019  0.7767689
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Automated and semi-automated parameter tuning

If your statistical experience is limited to linear regression the concept of parameter tuning may be new to you. In this context a “parameter” refers to a mathematical term or other aspect of the model that has an influence on the model results but the “best” value for this parameter cannot be derived analytically (which is not the same as, for example, linear regression coefficients which can be computed with linear algebra). Examples might include:

  • In a support vector machine model there is a parameter called “cost”. This parameter essentially sets the penalty (the cost) of misclassifying an observation (in a classification setting). Setting a higher cost parameter in your model means you’re forcing the model to be more flexible and fit the observations in hand better. But this may come at the expense of poor predictions on new observations. The “best” value for cost cannot be determined analytically and will be different with different data so tuning is used to identify an appropriate value.
  • A random forest model creates many, many decision trees and averages them to create predictions. If each tree were computed using all possible predictor variables you would end up with a lot of nearly identical trees. The beauty of a random forest model, though, is that for each new tree the algorithm randomly selects a subset of your predictors which helps to de-correlate the trees. The number of randomly selected predictors in this process affects the performance of the model but the best value cannot be derived analytically and will be different with different data. In this case the number of randomly selected predictors (called mtry in caret) is a tuning parameter.

So, for parameters like cost and mtry, the way to determine the best value is to try several different possibilities and compare the results. To do this manually, though, would take a lot of coding. You would need to identify several different possibilities for the parameter, then you would need to run separate resampling on each and then compare the results. But again, caret to the rescue.

We discussed above that when you run this line of code caret automatically performs resampling to estimate a realistic R-squared. But what we didn’t mention was that it is also performing parameter tuning automatically.

rf1 <- train(annual_pm~., data = air, method = "rf")

For linear regression when we typed lm1$results the table had just one line. This is because linear regression models do not have a tuning parameter and so a single round of resampling is all that is needed (and thus one row of results). But in the output below you can see three rows for the results table for random forest.

rf1$results
##   mtry     RMSE  Rsquared    RMSESD RsquaredSD
## 1    2 1.420308 0.6360171 0.3091350  0.1170776
## 2    6 1.422272 0.6077977 0.3106837  0.1193877
## 3   10 1.478885 0.5696262 0.3124973  0.1411958

For each different value of mtry, the tuning parameter, caret performs a separate bootstrap (or cross validation if this is what you set up). Since three values were tested and the default is a 25 rep bootstrap, this means that caret initially ran the model 75 times.

Once caret determines the best value of the tuning parameter, in this case the value of 2 is best for mtry (highest R-squared) it runs the model one more time using this value for the tuning parameter and all the observations and this is stored in finalModel.

A final note on the parameter tuning. For mtry above caret made an educated guess as to which values to test. There will be situations where you want to test more or different values. If this is the case you have two options:

  • You can use the tuneLength argument in train() which tells caret how many possible values to test. This is the lazy modelers choice, tell caret how many but not which ones.
  • You can use the tuneGrid argument in train() to tell caret exactly which values to test.

So with mtry you might want to test 2, 4, 6, 8, 10 in which case you would use:

tg <- data.frame(mtry = seq(2, 10, by =2))
rf1 <- train(annual_pm~., data = air, method = "rf", tuneGrid = tg)
rf1$results
##   mtry     RMSE  Rsquared    RMSESD RsquaredSD
## 1    2 1.476867 0.6582157 0.4241288  0.1136340
## 2    4 1.467160 0.6377244 0.4285745  0.1124856
## 3    6 1.470359 0.6252342 0.4171333  0.1114522
## 4    8 1.489850 0.6048688 0.4053431  0.1132557
## 5   10 1.521484 0.5832912 0.4071158  0.1178970

Easy comparison of models

The final amazing feature of caret we want to point out is how easy it is to compare different types of models. As we discussed above, caret makes it easy to test multiple different models with the use of the method argument. Once you’ve created the different models, you will want to compare their performance.

In caret you can compare models in a couple of ways. In the first, you can stack the models together in a list and then compare the results qualitatively using the function resamples(). You’ll see below that the default for regression-type models is to provide a comparison using R-squared and RMSE and that it gives you not just the average, but the min, max and IQR. If you want to use performance statistics other than R-squared or RMSE you can set caret to do this.

Below you can see that the lm model has the lowest RMSE and highest R-squared (it is the better of the two models).

model_list <- list(lm = lm1, rf = rf1)
res <- resamples(model_list)
summary(res)
## 
## Call:
## summary.resamples(object = res)
## 
## Models: lm, rf 
## Number of resamples: 25 
## 
## RMSE 
##         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## lm 1.0298477 1.169275 1.255261 1.343919 1.514009 1.946573    0
## rf 0.7969272 1.161100 1.394194 1.467160 1.757357 2.483570    0
## 
## Rsquared 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lm 0.5403744 0.6370496 0.6910634 0.6863549 0.7499508 0.7947196    0
## rf 0.4539398 0.5441761 0.6308888 0.6377244 0.6967119 0.8659140    0

Alternatively, if you want to quantitatively test if one model is better than another you can use compare_models(). In this case we have limited data so they are not statistically different.

compare_models(lm1, rf1)
## 
##  One Sample t-test
## 
## data:  x
## t = -1.306, df = 24, p-value = 0.2039
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.31799933  0.07151791
## sample estimates:
##  mean of x 
## -0.1232407

A “real-world” example: Air quality data from NYC

For nearly a decade we have consulted for the New York City Department of Health on their New York City Community Air Survey. We initially worked with them to develop and implement the air monitoring network, the biggest urban air monitoring network in the world, and have subsequently helped to analyze and model the data.

In this example, we will use the air quality monitoring data from the NYC DOH as the response – the variable we’re trying to predict. The raw data is made available by NYCDOH on the NYC open data website. For this post we computed annual averages of fine particulate matter for a single year.

The candidate predictor variables are traffic, population and other variables measured at the monitoring locations. These variables were computed outside of our work on NYCCAS using a variety of sources.

We will use three different model types, an automated stepwise regression, an Elastic Net model and a gradient boosting machine model. Note that automated stepwise regression is frowned on by statisticians these days in favor of shrinkage models like LASSO, Ridge and Elastic Net; but it is a commonly used approach so I’m including it for comparison. Also note that this modeling is overkill given that we only have 63 observations.

Here is what the data look like

## Observations: 63
## Variables: 12
## $ latitude   <dbl> 40.51055, 40.54816, 40.58056, 40.58262, 40.58573, 4...
## $ longitude  <dbl> -74.23717, -74.18042, -74.15177, -73.82596, -73.968...
## $ site_id    <chr> "228", "952", "2269", "2496", "2596", "2818", "3156...
## $ annual_pm  <dbl> 7.691888, 8.043857, 7.511332, 7.281281, 8.067978, 8...
## $ Kvol100    <dbl> 14.35538, 234.78818, 48.79180, 67.05621, 25.96147, ...
## $ pop1000    <dbl> 7696.2348, 14797.4436, 11016.4623, 6590.6164, 49063...
## $ oil6_500   <int> 0, 0, 0, 2, 15, 0, 1, 0, 5, 0, 0, 1, 7, 0, 3, 0, 0,...
## $ truck300   <dbl> 0.2876064, 1.7033017, 0.5729620, 0.3187777, 0.00000...
## $ road100    <dbl> 0.5144213, 0.9322903, 0.3499070, 0.6696431, 0.55825...
## $ p_ind300   <dbl> 0.0000000000, 0.0000000000, 0.0000000000, 0.0000000...
## $ p_bldg1000 <dbl> 0.6452096, 0.7444151, 0.5804166, 0.5030070, 2.25981...
## $ imperv50   <dbl> 47.500000, 32.636364, 5.615385, 94.583333, 46.41666...

As an example, air quality vs truck traffic near the monitor

As an example, you can see that annual particulate matter is strongly related to the truck traffic within 300 meters of air monitoring locations. Keep in mind that high truck traffic would likely occur in areas with high total traffic and potentially high density of other emissions sources so truck traffic is responsible for some of the extra pollution and is a stand-in for some.

library(ggplot2)
ggplot(lur, aes(truck300, annual_pm)) + geom_point() + geom_smooth()

Set up the resampling approach

Here we will use 10-fold cross validation 5 times (repeated CV). Note again that this is overkill for a dataset with 63 observations.

library(caret)
# Set seed for reproducibility
set.seed(100)

# Set up the resampling, here repeated CV
tr <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

Do the modeling

Stepwise regression with caret

Here we run automated stepwise regression and look at the resampling results. You’ll note that there is no tuning parameter so the final results table has just one row.

# Note that trace is a parameter sent to the underlying modeling function
step_model <- train(annual_pm ~ ., data = dplyr::select(lur, -site_id), 
                    method = "lmStepAIC", trControl = tr, trace = FALSE)
step_model$results
##   parameter     RMSE  Rsquared    RMSESD RsquaredSD
## 1      none 1.082198 0.7814964 0.2918867  0.1714797

We can also review the variables and coefficients that the stepwise process included in the final model. In this case truck traffic, industrial land use, building area and impervious surfaces.

step_model$finalModel
## 
## Call:
## lm(formula = .outcome ~ truck300 + p_ind300 + p_bldg1000 + imperv50, 
##     data = dat)
## 
## Coefficients:
## (Intercept)     truck300     p_ind300   p_bldg1000     imperv50  
##     7.05077      0.69308     35.27333      0.31811      0.01227

Shrinkage/regularization models with caret

Here we run a shrinkage/regularization model (method = "glmnet") which has two tuning parameters alpha and lambda. If alpha is set to 0 this process runs a ridge model, if it’s set to 1 it runs a LASSO model and an alpha between 0 and 1 results in an elastic net model. For more details you can see the vignette on glmnet written by the creator Trevor Hastie.

In this case caret tests a range of possible alpha and lambda values and selects, for the final model, a blended model (elastic net).

Take note of the preProcess argument below, caret makes it super-easy to center/scale your data or apply other transformation like BoxCox.

# Note that we are scaling the predictors
glmnet_model <- train(annual_pm ~ ., data = dplyr::select(lur, -site_id), 
                    preProcess = c("center", "scale"),
                    method = "glmnet", trControl = tr)

# Caret tested many values of the tuning parameters so we limit output
arrange(glmnet_model$results, RMSE) %>% head
##   alpha      lambda     RMSE  Rsquared    RMSESD RsquaredSD
## 1  0.10 0.330925285 1.046882 0.8213086 0.3711204  0.1662474
## 2  1.00 0.033092528 1.057797 0.8151413 0.3165820  0.1661203
## 3  0.55 0.033092528 1.058651 0.8152392 0.3179481  0.1677805
## 4  0.10 0.033092528 1.067397 0.8131885 0.3243109  0.1708488
## 5  1.00 0.003309253 1.073726 0.8113261 0.3224757  0.1711788
## 6  0.55 0.003309253 1.073969 0.8109472 0.3231762  0.1722758

You can identify the best tuning values from caret using the bestTune attribute:

glmnet_model$bestTune
##   alpha    lambda
## 3   0.1 0.3309253

Gradient boosting machine model with caret

A gradient boosting model is similar to random forest in that it runs a ton of decision trees and uses these to compute an average. The value added by gradient boosting models is that the modeling is sequential and each step learns from the previous step. Meaning, with a random forest model all trees are run completely separately and they don’t learn from each other. With a GBM model, on the other hand, high residuals from one step are upweighted when they get fed into the next step and, as a result, each tree can “learn”” from previous trees.

A GBM model, though, can be particularly computationally intensive. There are four different parameters to tune (random forest has just one) and since the many trees are created sequentially some pieces cannot be parallelized.

Here we start by setting the possible values for the tuning parameters we want caret to test.

# We use expand grid to create a table of all possible combinations
tg <- expand.grid(shrinkage = seq(0.1, 1, by = 0.2), 
                  interaction.depth = c(1, 3, 7, 10),
                  n.minobsinnode = c(2, 5, 10),
                  n.trees = c(100, 300, 500, 1000))

# Verbose is a parameter sent to the underlying modeling function
gbm_model <- train(annual_pm ~ ., data = dplyr::select(lur, -site_id), 
                    method = "gbm", trControl = tr, tuneGrid =tg, verbose = FALSE)
arrange(gbm_model$results, RMSE) %>% head
##   shrinkage interaction.depth n.minobsinnode n.trees     RMSE  Rsquared
## 1       0.1                10              2     100 1.179316 0.7998527
## 2       0.1                 1              2     100 1.180998 0.7746926
## 3       0.1                10              2     300 1.201564 0.7959675
## 4       0.1                 7              2     100 1.201816 0.7846065
## 5       0.1                10              2     500 1.202396 0.7955054
## 6       0.1                10              2    1000 1.202401 0.7955336
##      RMSESD RsquaredSD
## 1 0.4519298  0.1700561
## 2 0.4731045  0.2003828
## 3 0.4530614  0.1703326
## 4 0.4132786  0.1822677
## 5 0.4532850  0.1707380
## 6 0.4533294  0.1707287
gbm_model$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 37     100                10       0.1              2

Compare the models

Visually inspect the results table with resamples(). Here we see that the elastic net has the lowest RMSE and highest R-squared

mods <- resamples(list(lm = step_model, elasticnet = glmnet_model , gbm = gbm_model))
summary(mods)
## 
## Call:
## summary.resamples(object = mods)
## 
## Models: lm, elasticnet, gbm 
## Number of resamples: 50 
## 
## RMSE 
##                 Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## lm         0.4021561 0.8564585 1.137697 1.082198 1.290742 1.561227    0
## elasticnet 0.3536700 0.8391356 0.957866 1.046882 1.254815 1.864322    0
## gbm        0.4461168 0.8513675 1.079652 1.179316 1.438047 2.393900    0
## 
## Rsquared 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## lm         0.2977450 0.7101422 0.8248998 0.7814964 0.8964441 0.9855474
## elasticnet 0.2063195 0.7597394 0.8628661 0.8213086 0.9442554 0.9881593
## gbm        0.2696347 0.7187364 0.8644578 0.7998527 0.9286043 0.9863320
##            NA's
## lm            0
## elasticnet    0
## gbm           0

Quantitatively compare the elastic net and GBM models to check if the difference is statistically significant.

compare_models(glmnet_model, gbm_model)
## 
##  One Sample t-test
## 
## data:  x
## t = -1.756, df = 49, p-value = 0.08534
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.28399574  0.01912693
## sample estimates:
##  mean of x 
## -0.1324344

We can see that the difference is not statistically significant which may be due to the small sample size in the data.

Summary: the caret package is an incredibly useful modeling tool

Even if you only model using linear regression caret can improve your workflow by simplifying data splitting, automating your resampling and providing a vehicle for comparing models. If you need to compare different model types and particularly if you run models with tuning parameters caret will save you an incredible amount of time by automating resampling on different settings of your tuning parameters and allowing you to use a consistent syntax across hundreds of different model types. It will take some time to get up to speed with caret but this time will pay serious dividends in future modeling workflow efficiency.

To learn more, much more, I highly recommend you purchase the book (yes, an actual book) by the creator of caret, Max Kuhn, Applied Predictive Modeling. This is not an idle recommendation — the book is excellent. It provides background on a wide range of modeling methods and sample code. I also highly recommend reading Introduction to Statistical Learning which I honestly consider one of the best books of any type I’ve ever read. For both books, the authors went out of their way to make the examples relevant and the writing clear and understandable.

Posted in R

9 responses

    • Unfortunately I’m not allowed to post the data. I got permission to do the blog post but not share the data. But the code is general enough, especially with the ~. for the predictor variables that you should be able to adapt easily.

  1. Thanks for this very clear explanation of how to use the caret package for linear regression. Very helpful! The other posts about the caret package I found thus far didn’t help me half as much as you did.

  2. I usually don’t comment blogs. But this post deserves it! Greatly written, clear language and structure. Thank you so very much!
    I wish my professor knew how to explain like this.

  3. I believe preProcess=c(“center”, “scale”) is not necessary when method=”glmnet” because glmnet automatically carries out center and scaling.

  4. I think, the non-existing “significance” when you do compare_models() is less due to the small sample size of the data but rather because you only did 50 iterations when you trained the models. If you had used something like “repeats = 100” I assume it would be “significant”. That said, I think it becomes clear why significance is not really a criterion here, as you can increase the number of iteration as long until you reach it.

    It is as always, if your number of observations is large enough, everything becomes significant. And here, you can create as many observations for the test as you want to. I think it is a bit dangerous to imply that “significance” is a valid criterion here.

Leave a Reply

Your email address will not be published. Required fields are marked *