Why does caret::predict() use parallel processing

I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.

I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.

After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.

Surely predict could be done on parallel with any model, not just xgb?

Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?

Reproducible example:

library(tidyverse)
library(caret)
library(foreach)

# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree", 
                trControl = trainControl(method = "cv", classProbs = TRUE))

iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))

nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)

# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop

  # get prediction
  preds <- pieces[[i]] %>% 
    mutate(xgb_prediction = predict(xgbFit, newdata = .))

  return(preds)
}

If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.

So predict seems to use parallel processing automatically depending on the type of model.

Is this correct? Is it controllable?

In this issue you can find the information you need:

https://github.com/dmlc/xgboost/issues/1345

As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing. If you want to change the latter behaviour you must change a setting:

xgb.parameters(bst) <- list(nthread = 1)

An alternative, is to change an environment variable:

OMP_NUM_THREADS

And as you explain, this only happens for xgbTree

Why does caret::predict() use parallel processing

问题:

回答1:

收藏的人(0)

Why does caret::predict() use parallel processing

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮