可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

How do you generate a prediction interval from a regression tree that is fit using rpart?

It is my understanding that a regression tree models the response conditional on the mean of the leaf nodes. I don't know how to get the variance for a leaf node from the model, but what I would like to do is simulate using the mean and variance for a leaf node to obtain a prediction interval.

Predict.rpart() doesn't give an option for interval.

Example: I fit a tree with iris data, but predict doesn't have an option, "interval"

> r1 <- rpart(Sepal.Length ~ ., cp = 0.001, data = iris[1:nrow(iris)-1,])
> predict(r1,newdata=iris[nrow(iris),],type = "interval")
Error in match.arg(type) : 
  'arg' should be one of “vector”, “prob”, “class”, “matrix”

回答1:

It is not clear to me what confidence intervals would mean for regression trees as those are not classical statistical models like linear models. And I see mainly two uses: characterising the certainty of your tree or characterizing the precision of the prediction for each leaf of the tree. Hereafter an answer for each of these possibilities.

Characterizing the certainty of your tree

If you are looking for a confidence value for a split node, then party provides that directly as it uses permutation tests and statistically determine which variables are most important and the p-value attached to each split. A significant superiority of party's ctree function over rpart as explained here.

Confidence intervals for set leafs of the regression tree

Third, if you are looking for a confidence of interval for the value in each leaf, then the [0.025,0.975] quantiles interval for the observations in the leaf is most likely what you are looking for. The default plots in party takes a similar approach when displaying boxplots for the output value for each leaf:

library("party")
r2 <- ctree(Sepal.Length ~ .,data=iris)
plot(r2)

Retrieving the corresponding intervals can simply be done by:

iris$leaf <- predict(r2, type="node")
CIleaf <- aggregate(iris$Sepal.Length,
                    by=list(leaf=iris$leaf),
                    quantile,
                    prob=c(0.025, 0.25, 0.75, 0.975))

And it's easy to visualize:

plot(as.factor(CIleaf$leaf), CIleaf[, 2],
     ylab="Sepal length", xlab="Regression tree leaf")
legend("bottomright",
       c(" 0.975 quantile", " 0.75 quantile", " mean", 
         " 0.25 quantile", " 0.025 quantile"),
       pch=c("-", "_", "_", "_", "-"),
       pt.lwd=0.5, pt.cex=c(1, 1, 2, 1, 1), xjust=1)

回答2:

Perhaps one option is a simple bootstrap of your training data?

library(rpart)
library(boot)

trainData <- iris[-150L, ]
predictData <- iris[150L, ]

rboot <- boot(trainData, function(data, idx) {
            bootstrapData <- data[idx, ]
            r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.001)
            predict(r1, newdata = predictData)
        }, 1000L)

quantile(rboot$t, c(0.025, 0.975))
    2.5%    97.5% 
5.871393 6.766842