I'm taking part in the Coursera Practical Machine Learning course, and the coursework requires building predictive models using this dataset. After splitting the data into training
and testing
datasets, based on the outcome of interest (herewith labelled y
, but is in fact the classe
variable in the dataset):
inTrain <- createDataPartition(y = data$y, p = 0.75, list = F)
training <- data[inTrain, ]
testing <- data[-inTrain, ]
I have tried 2 different methods:
modFit <- caret::train(y ~ ., method = "rpart", data = training)
pred <- predict(modFit, newdata = testing)
confusionMatrix(pred, testing$y)
vs.
modFit <- rpart::rpart(y ~ ., data = training)
pred <- predict(modFit, newdata = testing, type = "class")
confusionMatrix(pred, testing$y)
I would assume they would give identical or very similar results, as the initial method loads the 'rpart' package (suggesting to me it uses this package for the method). However, the timings (caret
much slower) & results are very different:
Method 1 (caret)
:
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1264 374 403 357 118
B 25 324 28 146 124
C 105 251 424 301 241
D 0 0 0 0 0
E 1 0 0 0 418
Method 2 (rpart)
:
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1288 176 14 79 25
B 36 569 79 32 68
C 31 88 690 121 113
D 14 66 52 523 44
E 26 50 20 49 651
As you can see, the second approach is a better classifier - the first method is very poor for classes D & E.
I realise this may not be the most appropriate place to ask this question, but I would really appreciate a greater understanding of this and related issues. caret
seems like a great package to unify the methods and call syntax, but I'm now hesitant to use it.