I am trying to create a binary CNN classifier for an unbalanced dataset (class 0 = 4000 images, class 1 = around 250 images), which I want to perform 5-fold cross validation on. Currently I am loading my training set into an ImageLoader that applies my transformations/augmentations(?) and loads it into a DataLoader. However, this results in both my training splits and validation splits containing the augmented data.
I originally applied transformations offline (offline augmentation?) to balance my dataset, but from this thread (https://stats.stackexchange.com/questions/175504/how-to-do-data-augmentation-and-train-validate-split), it seems it would be ideal to only augment the training set. I would also prefer to train my model on solely augmented training data and then validate it on non-augmented data in a 5-fold cross validation
My data is organized as root/label/images, where there are 2 label folders (0 and 1) and images sorted into the respective labels.
My Code So Far
total_set = datasets.ImageFolder(ROOT, transform = data_transforms['my_transforms'])
//Eventually I plan to run cross-validation as such:
splits = KFold(cv = 5, shuffle = True, random_state = 42)
for train_idx, valid_idx in splits.split(total_set):
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
train_loader = torch.utils.data.DataLoader(total_set, batch_size=32, sampler=train_sampler)
val_loader = torch.utils.data.DataLoader(total_set, batch_size=32, sampler=valid_sampler)
model.train()
//Model train/eval works but may be overpredict
I'm sure I'm doing something sub-optimally or wrong in this code, but I can't seem to find any documentation on specifically augmenting only the training splits in cross-validation!
Any help would be appreciated!