Calculate cross validation for Generalized Linear

2020-06-27 02:10发布

问题:

I am doing a regression using Generalized Linear Model.I am caught offguard using the crossVal function. My implementation so far;

x = 'Some dataset, containing the input and the output'

X = x(:,1:7);
Y = x(:,8);

cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);

mdl = GeneralizedLinearModel.fit(Xtrain,Ytrain,'linear','distr','poisson');

Ypred  = predict(mdl,Xtest);
res = (Ypred - Ytest);
RMSE_test = sqrt(mean(res.^2));

The code below is for calculating cross validation for mulitple regression as obtained from this link. I want something similar for Generalized Linear Model.

c = cvpartition(Y,'k',10);
regf=@(Xtrain,Ytrain,Xtest)(Xtest*regress(Ytrain,Xtrain));
cvMse = crossval('mse',X,Y,'predfun',regf)

回答1:

You can either perform the cross-validation process manually (training a model for each fold, predict outcome, compute error, then report the average across all folds), or you can use the CROSSVAL function which wraps this whole procedure in a single call.

To give an example, I will first load and prepare a dataset (a subset of the cars dataset which ships with the Statistics Toolbox):

% load regression dataset
load carsmall
X = [Acceleration Cylinders Displacement Horsepower Weight];
Y = MPG;

% remove instances with missing values
missIdx = isnan(Y) | any(isnan(X),2);
X(missIdx,:) = [];
Y(missIdx) = [];

clearvars -except X Y

Option 1

Here we will manually partition the data using k-fold cross-validation using cvpartition (non-stratified). For each fold, we train a GLM model using the training data, then use the model to predict output of testing data. Next we compute and store the regression mean squared error for this fold. At the end, we report the average RMSE across all partitions.

% partition data into 10 folds
K = 10;
cv = cvpartition(numel(Y), 'kfold',K);

mse = zeros(K,1);
for k=1:K
    % training/testing indices for this fold
    trainIdx = cv.training(k);
    testIdx = cv.test(k);

    % train GLM model
    mdl = GeneralizedLinearModel.fit(X(trainIdx,:), Y(trainIdx), ...
        'linear', 'Distribution','poisson');

    % predict regression output
    Y_hat = predict(mdl, X(testIdx,:));

    % compute mean squared error
    mse(k) = mean((Y(testIdx) - Y_hat).^2);
end

% average RMSE across k-folds
avrg_rmse = mean(sqrt(mse))

Option 2

Here we can simply call CROSSVAL with an appropriate function handle which computes the regression output given a set of train/test instances. See the doc page to understand the parameters.

% prediction function given training/testing instances
fcn = @(Xtr, Ytr, Xte) predict(...
    GeneralizedLinearModel.fit(Xtr,Ytr,'linear','distr','poisson'), ...
    Xte);

% perform cross-validation, and return average MSE across folds
mse = crossval('mse', X, Y, 'Predfun',fcn, 'kfold',10);

% compute root mean squared error
avrg_rmse = sqrt(mse)

You should get a similar result compared to before (slightly different of course, on account of the randomness involved in the cross-validation).