How to count the observations falling in each node

2019-02-05 16:19发布

问题:

I am currently dealing with wine data in MMST package. I have split the whole dataset into training and test and build a tree like the following codes:

library("rpart")
library("gbm")
library("randomForest")
library("MMST")

data(wine)
aux <- c(1:178)
train_indis <- sample(aux, 142, replace = FALSE)
test_indis <- setdiff(aux, train_indis)

train <- wine[train_indis,]
test <- wine[test_indis,]    #### divide the dataset into trainning and testing

model.control <- rpart.control(minsplit = 5, xval = 10, cp = 0)
fit_wine <- rpart(class ~ MalicAcid + Ash + AlcAsh + Mg + Phenols + Proa + Color + Hue + OD + Proline, data = train, method = "class", control = model.control)

windows()
plot(fit_wine,branch = 0.5, uniform = T, compress = T,  main = "Full Tree: without pruning")
text(fit_wine, use.n = T, all = T, cex = .6)

And I could get a image like this:

What does the number under each node (for example 0/1/48 under Grignolino) mean? If I want to know how many training and testing sample fall into each node, what should I write in the codes?

回答1:

The numbers indicate the number of members of each class in that node. So, the label "0 / 1 / 48" tells us that there are 0 cases of category 1 (Barabera, I infer), only one example of category 2 (Barolo), and 48 of category 3 (Grignolino).

You can get detailed information about the tree and each node using summary(fit_wine).
See ?summary.rpart for more details.

You can additionally use predict() (which will call predict.rpart()) to see how the tree categorizes a dataset. For example, predict(fit_wine, train, type="class"). Or wrap it in a table for easy viewing table(predict(fit_wine, train, type = "class"),train[,"class"])

If you specifically want to know which leaf node an observation falls on, this information is stored in fit_wine$where. For each case in the data set,fit_wine$where contains the row number of fit_wine$frame that represents the leaf node where the case falls. So we can get the leaf information for each case with:

trainingnodes <- rownames(fit_wine$frame)[fit_wine$where]

In order to get the leaf info for test data, I used to run predict() with type="matrix" and infer it. This returns, confusingly, a matrix produced by concatenating the predicted class, the class counts at that node in the fitted tree, and the class probabilities. So for this example:

testresults <- predict(fit_wine, test, type = "matrix")
testresults <- data.frame(testresults)
names(testresults) <- c("ClassGuess","NofClass1onNode", "NofClass2onNode",
     "NofClass3onNode", "PClass1", "PClass2", "PClass2")

From this, we can infer the different nodes, e.g., from unique(testresults[,2:4]) but it is inelegant.

However, Yuji has a clever hack for this at a previous question. He copies the rpart object and substitutes the nodes in for the classes, so running predict returns the node not the class:

nodes_wine <- fit_wine
nodes_wine$frame$yval = as.numeric(rownames(nodes_wine$frame))
testnodes <- predict(nodes_wine, test, type="vector")

I've included the solution here, but people go should upvote him .