import pandas as pd
import numpy as np
import cv2
from torch.utils.data.dataset import Dataset
class CustomDatasetFromCSV(Dataset):
def __init__(self, csv_path, transform=None):
self.data = pd.read_csv(csv_path)
self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
self.height = 48
self.width = 48
self.transform = transform
def __getitem__(self, index):
pixels = self.data['pixels'].tolist()
faces = []
for pixel_sequence in pixels:
face = [int(pixel) for pixel in pixel_sequence.split(' ')]
# print(np.asarray(face).shape)
face = np.asarray(face).reshape(self.width, self.height)
face = cv2.resize(face.astype('uint8'), (self.width, self.height))
faces.append(face.astype('float32'))
faces = np.asarray(faces)
faces = np.expand_dims(faces, -1)
return faces, self.labels
def __len__(self):
return len(self.data)
This is what I could manage to do by using references from other repositories. However, I want to split this dataset into train and test.
How can I do that inside this class? Or do I need to make a separate class to do that?
This is the PyTorch
Subset
class attached holding therandom_split
method. Note that this method is base for theSubsetRandomSampler
.For MNIST if we use
random_split
:We get:
Our
test_ds.indices
andvalid_ds.indices
will be random from range(0, 600000)
. But if I would like to get sequence of indices from(0, 49999)
and from(50000, 59999)
I cannot do that at the moment unfortunately, except this way.Handy in case you run the MNIST benchmark where it is predefined what should be the test and what should be the validation dataset.
Custom dataset has a special meaning in PyTorch, but I think you meant any dataset. Let's check out the MNIST dataset (this is probable the most famous dataset for the beginners).
What this will outupt, is size of the original
[60000, 28, 28]
, then the splits[50000, 28, 28]
for test and[10000, 28, 28]
for validation:Additional info if you actually plan to pair images and labels (targets) together
Starting in PyTorch 0.4.1 you can use
random_split
:Current answers do random splits which has disadvantage that number of samples per class is not guaranteed to be balanced. This is especially problematic when you want to have small number of samples per class. For example, MNIST has 60,000 examples, i.e. 6000 per digit. Assume that you want only 30 examples per digit in your training set. In this case, random split may produce imbalance between classes (one digit with more training data then others). So you want to make sure each digit precisely has only 30 labels. This is called stratified sampling.
One way to do this is using sampler interface in Pytorch and sample code is here.
Another way to do this is just hack your way through :). For example, below is simple implementation for MNIST where
ds
is MNIST dataset andk
is number of samples needed for each class.You can use this function like this:
Using Pytorch's
SubsetRandomSampler
: