This question is a continuation of the same thread here. Below is a minimal working example taken from this book:
Wehrens R. Chemometrics with R multivariate data analysis in the natural sciences and life sciences. 1st edition. Heidelberg; New York: Springer. 2011. (page 250).
The example was taken from this book and its package ChemometricsWithR
. It highlighted some pitfalls when modeling using cross-validation techniques.
The Aim:
A cross-validated methodology using the same set of repeated CV to perform a known strategy of PLS
followed typically by LDA
or cousins like logistic regression, SVM, C5.0, CART, with the spirit of caret
package. So PLS would be needed every time before calling the waiting classifier in order to classify PLS score space instead of the observations themselves. The nearest approach in the caret package is doing PCA
as a pre-processing step before modeling with any classifier. Below is a PLS-LDA procedure with only one cross-validation to test performance of the classifier, there was no 10-fold CV or any repetition. The code below was taken from the mentioned book but with some corrections otherwise throws error:
library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var
odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test
prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)
Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))
tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test
prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])
predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)
library(caret)
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret
Output:
Confusion Matrix and Statistics
Reference
Prediction bph control pca
bph 4 1 9
control 1 35 7
pca 34 4 68
Overall Statistics
Accuracy : 0.6564
95% CI : (0.5781, 0.7289)
No Information Rate : 0.5153
P-Value [Acc > NIR] : 0.0001874
Kappa : 0.4072
Mcnemar's Test P-Value : 0.0015385
Statistics by Class:
Class: bph Class: control Class: pca
Sensitivity 0.10256 0.8750 0.8095
Specificity 0.91935 0.9350 0.5190
Pos Pred Value 0.28571 0.8140 0.6415
Neg Pred Value 0.76510 0.9583 0.7193
Prevalence 0.23926 0.2454 0.5153
Detection Rate 0.02454 0.2147 0.4172
Detection Prevalence 0.08589 0.2638 0.6503
Balanced Accuracy 0.51096 0.9050 0.6643
However, the confusion matrix didn't match that in the book, anyway the code in the book did break, but this one here worked with me!
Notes:
Although this was only one CV, but the intention is to agree on this methodology first, sd
and mean
of the train set were applied on the test set, PLUS transformed into PLS scores based a specific number of PC ncomp
. I want this to occur every round of the CV in the caret. If the methodology as code is correct here, then it can serve, may be, as a good start for a minimal work example while modifying the code of the caret package.
Side Notes:
It can be very messy with scaling and centering, I think some of the PLS functions in R do scaling internally, with or without centering, I am not sure, so building a custom model in caret should be handled with care to avoid both lack or multiple scalings or centerings (I am on my guards with these things).
Perils of multiple centering/scaling
The code below is just to show how multliple centering/scaling can change the data, only centering is shown here but the same problem with scaling applies too.
set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)
Output:
xNotCentered xCentered1 xCentered2 xCentered3
2.035540e+00 1.897798e-16 -5.603699e-18 -5.332377e-18
Please drop a comment if I am missing something somewhere in this course. Thanks.
If you want to fit these types of models with
caret
, you would need to use the latest version on CRAN. The last update was created so that people can use non-standard models as they see fit.My approach below is to jointly fit the PLS and other model (I used random forest in the example below) and tune them at the same time. So for each fold, a 2D grid of
ncomp
andmtry
is used.The "trick" is to attached the PLS loadings to the random forest object so that they can be used during prediction time. Here is the code that defines the model (classification only):
and here is the call to
train
:Just for kicks, I got:
and
Max
Based on Max's valuable comments, I felt the need to have IRIS referee, which is famous for classification, and more importantly the
Species
outcome has more than two classes, which would be a good data set to test the PLS-LDA custom model in caret:Now comparing the two models:
So, finally there was the EXPECTED difference, and improvement in the metrics. So this would support Max's notion, that two-class problems because of Bayes' probabilistic approach of
plsda
function both lead to the same results.If the scenario were to custom a model of PLS-LDA type, according to the code kindly provided by Max (maintainer of CARET), something is not corect in this code, but I didn't figure it out, because I used the Sonar data set the same in
caret
vignette and tried to reproduce the result one time usingmethod="pls"
and another time using the below custom model for PLS-LDA, the results were exactly identical even to the last digit, which was nonsensical. For benchmarking, one need a known data set (I think a cross-validated PLS-LDA for iris data set would fit here as it is famous for this type of analysis and there should be somewhere a cross-validated treatment of it), everything should be the same (the set.seed(xxx) and the no of K-CV repitition) except the code in question so as to rightly compare and to judge the code below:Based on Zach's request, the code below is for
method="pls"
in caret, exactly the same concrete example in caret vigenette on CRAN:Now, the code below is a pilot run to classify Sonar data using the custom model
PLS-LDA
which is under question, it is expected to come up with any numbers apart from identical with those using PLS only:Now comparing the results between the two models:
So, the results are exactly the same which cannot be. As if the
lda
model were not added?plsr
andlda
center the data their own wayI had a closer look at
caret::preProcess ()
: as it is defined now, you will not be able to use PLS as preprocessing method because it is supervised butcaret::preProcess ()
uses unsupervised methods only (there is no way to hand over the dependent variable). This would probably make patching rather difficult.So inside the caret framework, you'll need to go for a custom model.