-->

MATLAB: 10 fold cross Validation without using exi

2020-02-26 02:36发布

问题:

I have a matrix (I guess in MatLab you call it a struct) or data structure:

  data: [150x4 double]
labels: [150x1 double]

here is out my matrix.data looks like assume I do load my file with the name of matrix:

5.1000    3.5000    1.4000    0.2000
4.9000    3.0000    1.4000    0.2000
4.7000    3.2000    1.3000    0.2000
4.6000    3.1000    1.5000    0.2000
5.0000    3.6000    1.4000    0.2000
5.4000    3.9000    1.7000    0.4000
4.6000    3.4000    1.4000    0.3000
5.0000    3.4000    1.5000    0.2000
4.4000    2.9000    1.4000    0.2000
4.9000    3.1000    1.5000    0.1000
5.4000    3.7000    1.5000    0.2000
4.8000    3.4000    1.6000    0.2000
4.8000    3.0000    1.4000    0.1000
4.3000    3.0000    1.1000    0.1000
5.8000    4.0000    1.2000    0.2000
5.7000    4.4000    1.5000    0.4000
5.4000    3.9000    1.3000    0.4000
5.1000    3.5000    1.4000    0.3000
5.7000    3.8000    1.7000    0.3000
5.1000    3.8000    1.5000    0.3000

And here is my matrix.labels look like

 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

I am trying to create 10 cross fold validation without using any of the existing functions in MatLab and due to my very limited MatLab knowledge I am having trouble going forward with from what I have. Any help would be great.

This is what I have so far, and I am sure this probably not the matlab way, but I am very new to matlab.

function[output] = fisher(dataFile, number_of_folds)
    data = load(dataFile);
    %create random permutation indx
    idx = randperm(150);
    output = data.data(idx(1:15),:);
end

回答1:

Here is my take for this cross validation. I create dummy data using magic(10) also I create labels randomly. Idea is following , we get our data and labels and combine them with random column. Consider following dummy code.

>> data = magic(4)

data =

    16     2     3    13
     5    11    10     8
     9     7     6    12
     4    14    15     1

>> dataRowNumber = size(data,1)

dataRowNumber =

     4

>> randomColumn = rand(dataRowNumber,1)

randomColumn =

    0.8147
    0.9058
    0.1270
    0.9134


>> X = [ randomColumn data]

X =

    0.8147   16.0000    2.0000    3.0000   13.0000
    0.9058    5.0000   11.0000   10.0000    8.0000
    0.1270    9.0000    7.0000    6.0000   12.0000
    0.9134    4.0000   14.0000   15.0000    1.0000

If we sort X according column 1, we sort our data randomly. This will give us cross validation randomness. Then next thing is to divide X according to cross validation percentage. Accomplishing this for one case easy enough. Lets consider %75 percent is train case and %25 percent is test case. Our size here is 4, then 3/4 = %75 and 1/4 is %25.

testDataset = X(1,:)
trainDataset = X(2:4,:)

But accomplishing this a bit harder for N cross folds. Since we need to make this N times. For loop is necessary for this. For 5 cross folds. I get , in first f

  1. 1st fold : 1 2 for test, 3:10 for train
  2. 2nd fold : 3 4 for test, 1 2 5:10 for train
  3. 3rd fold : 5 6 for test, 1:4 7:10 for train
  4. 4th fold : 7 8 for test, 1:6 9:10 for train
  5. 5th fold : 9 10 for test, 1:8 for train

Following code is an example for this process:

data = magic(10);
dataRowNumber = size(data,1);
labels= rand(dataRowNumber,1) > 0.5;
randomColumn = rand(dataRowNumber,1);

X = [ randomColumn data labels];


SortedData = sort(X,1);

crossValidationFolds = 5;
numberOfRowsPerFold = dataRowNumber / crossValidationFolds;

crossValidationTrainData = [];
crossValidationTestData = [];
for startOfRow = 1:numberOfRowsPerFold:dataRowNumber
    testRows = startOfRow:startOfRow+numberOfRowsPerFold-1;
    if (startOfRow == 1)
        trainRows = [max(testRows)+1:dataRowNumber];
        else
        trainRows = [1:startOfRow-1 max(testRows)+1:dataRowNumber];
    end
    crossValidationTrainData = [crossValidationTrainData ; SortedData(trainRows ,:)];
    crossValidationTestData = [crossValidationTestData ;SortedData(testRows ,:)];

end


回答2:

Hahaha sorry, no solution. Don't have MATLAB on me right now so can't check code for errors. But here's the general idea:

  1. Generate k (in your case 10) subsamples
    1. Start two counters at 1 and preallocate new matrix: index = 1; subsample = 1; newmat = zeros("150","6") < 150 is the number of samples, 6 = 4 wide data + 1 wide labels + 1 we will use later
    2. While you still have data: while ( length(labels) > 0 )
    3. Generate a random number within the amount of data left: randNum = randi(length(labels))? I think that's a random int that goes from 1 to the size of your labels array (it could be 0, please check the doc - if it is, do simple math to make it 1 < rand < length)
    4. Add that row to a new data set with labels: newmat(index,:) = [data(randNum,:) labels(randNum) subsample] < that last column is the subsample number from 1-10
    5. Delete the row from data and labels: data(randNum,:) = []; same for labels < note this will physically remove a row from the matrices, which is why we have to use a while loop and check for length > 0 rather than a for loop and simple indices
    6. Increment counters: index = index + 1; subsample = subsample + 1;
    7. if subsample = 11, make it 1 again.

At the end of this, you should have a large data matrix that looks almost exactly like your original, but has randomly assigned "fold labels".

  1. Loop over all this and your executing code k (10) times.

EDIT: code placed in more accessible manner. NOTE it's still pseudo-y code and is not complete! Also, you should note that this is NOT AT ALL the most efficient way, but shouldn't be too bad if you can't use matlab functions.

for k = 1:10

index = 1; subsample = 1; newmat = zeros("150","6");
while ( length(labels) > 0 )
    randNum = randi(length(labels));
    newmat(index,:) = [data(randNum,:) labels(randNum) subsample];
    data(randNum,:) = []; same for labels
    index = index + 1; subsample = subsample + 1;
    if ( subsample == 11 )
        subsample = 1;
    end
end

% newmat is complete, now run code here using the sampled data 
%(ie pick a random number from 1:10 and use that as your validation fold. the rest for training

end

EDIT FOR ANSWER #2:

Ok another way, is to create a vector that is as long as your data set

foldLabels = zeros("150",1);

Then, looping for that long (150), assign labels to random indices!

foldL = 1;
numAssigned = 0;
while ( numAssigned < 150 )
    idx = randi(150);
    % no need to reassign a given label, so check if is still 0
    if ( foldLabels(idx) == 0 )
        foldLabels(idx) = foldL;
        numAssigned++; % not matlab code, just got lazy. you get it
        foldL++;
        if ( foldL > 10 )
            foldL = 1;
        end
    end
end

EDIT FOR ANSWER #2.5

foldLabels = zeros("150",1);
for i = 1:150
    notChosenLabels = [notChosenLabels i];
end
foldL = 1;
numAssigned = 0;
while ( length(notChosenLabels) > 0 )
    labIdx = randi(length(notChosenLabels));
    idx = notChosenLabels(labIdx);
    foldLabels(idx) = foldL;
    numAssigned++; % not matlab code, just got lazy. you get it
    foldL++;
    if ( foldL > 10 )
        foldL = 1;
    end
    notChosenLabels(labIdx) = [];
end

EDIT FOR RANDPERM

generate the indices with randperm

idxs = randperm(150);

now just assign

foldLabels = zeros(150,1);
for i = 1:150
    foldLabels(idxs(i)) = sampleLabel;
    sampleLabel = sampleLabel + 1;
    if ( sampleLabel > 10 )
       sampleLabel = 1;
    end
end