PCA Dimension reducion for classification

2019-01-28 10:22发布

问题:

I am using Principle Component Analysis on the features extracted from different layers of CNN. I have downloaded the toolbox of dimension reduction from here.

I have a total of 11232 training images and feature for each image is 6532. so the feature matrix is like that 11232x6532 If I want top 90% features I can easily do that and training accuracy using SVM of reduced data is 81.73% which is fair. However, when I try the testing data which have 2408 images and features of each image is 6532. so feature matrix for testing data is 2408x6532. In that case the output for top 90% feature is not correct it shows 2408x2408. and the testing accuracy is 25%. Without using dimension reduction the training accuracy is 82.17% and testing accuracy is 79%.
Update: Where X is the data and no_dims is required number of dimensions at output. the output of this PCA function is variable mappedX and structure mapping.

% Make sure data is zero mean
    mapping.mean = mean(X, 1);
    X = bsxfun(@minus, X, mapping.mean);

    % Compute covariance matrix
    if size(X, 2) < size(X, 1)
        C = cov(X);
    else
        C = (1 / size(X, 1)) * (X * X');        % if N>D, we better use this matrix for the eigendecomposition
    end

    % Perform eigendecomposition of C
    C(isnan(C)) = 0;
    C(isinf(C)) = 0;
    [M, lambda] = eig(C);

    % Sort eigenvectors in descending order
    [lambda, ind] = sort(diag(lambda), 'descend');
    if no_dims < 1
        no_dims = find(cumsum(lambda ./ sum(lambda)) >= no_dims, 1, 'first');
        disp(['Embedding into ' num2str(no_dims) ' dimensions.']);
    end
    if no_dims > size(M, 2)
        no_dims = size(M, 2);
        warning(['Target dimensionality reduced to ' num2str(no_dims) '.']);
    end
    M = M(:,ind(1:no_dims));
    lambda = lambda(1:no_dims);

    % Apply mapping on the data
    if ~(size(X, 2) < size(X, 1))
        M = bsxfun(@times, X' * M, (1 ./ sqrt(size(X, 1) .* lambda))');     % normalize in order to get eigenvectors of covariance matrix
    end
    mappedX = X * M;

    % Store information for out-of-sample extension
    mapping.M = M;
    mapping.lambda = lambda;

Based on your suggestion. I have calculated the vector for the training data.

numberOfDimensions = round(0.9*size(Feature,2));
[mapped_data, mapping] = compute_mapping(Feature, 'PCA', numberOfDimensions);

Then using same vector for testing data:

mappedX_test = Feature_test * mapping.M;

Still the accuracy is 32%

Solved by doing subtraction:

Y = bsxfun(@minus, Feature_test, mapping.mean);
mappedX_test = Y * mapping.M;

回答1:

It looks like you're doing dimensionality reduction on both the training and testing data separately. During training, you're supposed to remember the principal scores or basis vectors of the examples during training. Remember that you are finding a new representation of your data with a new set of orthogonal axes based on the training data. During testing, you repeat the exact same procedure as you did with the training data as you are representing the data with respect to these basis vectors. Therefore, you use the basis vectors for the training data to reduce your data down. You are only getting a 2408 x 2408 matrix because you are performing PCA on the test examples as it is impossible to produce basis vectors beyond the rank of the matrix in question (i.e. 2408).

Retain your basis vectors from the training stage and when it's time to perform classification in the testing stage, you must use the same basis vectors from the training stage. Remember that in PCA, you must centre your data by performing mean subtraction prior to the dimensionality reduction. To do this, in your code we note that the basis vectors are stored in mapping.M and the associated mean vector is stored in mapping.mean. When it comes to the testing stage, make sure you mean subtract your test data with the mapping.mean from the training stage:

Y = bsxfun(@minus, Feature_test, mapping.mean);

Once you have this, finally go ahead and dimensionality reduce your data:

mappedX_test = Y * mapping.M;