How do you generate a prediction interval from a regression tree that is fit using rpart?
It is my understanding that a regression tree models the response conditional on the mean of the leaf nodes. I don't know how to get the variance for a leaf node from the model, but what I would like to do is simulate using the mean and variance for a leaf node to obtain a prediction interval.
Predict.rpart() doesn't give an option for interval.
Example: I fit a tree with iris data, but predict doesn't have an option, "interval"
> r1 <- rpart(Sepal.Length ~ ., cp = 0.001, data = iris[1:nrow(iris)-1,])
> predict(r1,newdata=iris[nrow(iris),],type = "interval")
Error in match.arg(type) :
'arg' should be one of “vector”, “prob”, “class”, “matrix”
It is not clear to me what confidence intervals would mean for regression trees as those are not classical statistical models like linear models. And I see mainly two uses: characterising the certainty of your tree or characterizing the precision of the prediction for each leaf of the tree. Hereafter an answer for each of these possibilities.
Characterizing the certainty of your tree
If you are looking for a confidence value for a split node, then party
provides that directly as it uses permutation tests and statistically determine which variables are most important and the p-value attached to each split. A significant superiority of party
's ctree
function over rpart
as explained here.
Confidence intervals for set leafs of the regression tree
Third, if you are looking for a confidence of interval for the value in each leaf, then the [0.025,0.975] quantiles interval for the observations in the leaf is most likely what you are looking for. The default plots in party
takes a similar approach when displaying boxplots for the output value for each leaf:
library("party")
r2 <- ctree(Sepal.Length ~ .,data=iris)
plot(r2)
Retrieving the corresponding intervals can simply be done by:
iris$leaf <- predict(r2, type="node")
CIleaf <- aggregate(iris$Sepal.Length,
by=list(leaf=iris$leaf),
quantile,
prob=c(0.025, 0.25, 0.75, 0.975))
And it's easy to visualize:
plot(as.factor(CIleaf$leaf), CIleaf[, 2],
ylab="Sepal length", xlab="Regression tree leaf")
legend("bottomright",
c(" 0.975 quantile", " 0.75 quantile", " mean",
" 0.25 quantile", " 0.025 quantile"),
pch=c("-", "_", "_", "_", "-"),
pt.lwd=0.5, pt.cex=c(1, 1, 2, 1, 1), xjust=1)
Perhaps one option is a simple bootstrap of your training data?
library(rpart)
library(boot)
trainData <- iris[-150L, ]
predictData <- iris[150L, ]
rboot <- boot(trainData, function(data, idx) {
bootstrapData <- data[idx, ]
r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.001)
predict(r1, newdata = predictData)
}, 1000L)
quantile(rboot$t, c(0.025, 0.975))
2.5% 97.5%
5.871393 6.766842