I am trying to implement a Siamese network that takes in two images. I load these images and create two separate dataloaders.
In my loop I want to go through both dataloaders simultaneously so that I can train the network on both images.
for i, data in enumerate(zip(dataloaders1, dataloaders2)):
# get the inputs
inputs1 = data[0][0].cuda(async=True);
labels1 = data[0][1].cuda(async=True);
inputs2 = data[1][0].cuda(async=True);
labels2 = data[1][1].cuda(async=True);
labels1 = labels1.view(batchSize,1)
labels2 = labels2.view(batchSize,1)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs1 = alexnet(inputs1)
outputs2 = alexnet(inputs2)
The return value of the dataloader is a tuple.
However, when I try to use zip
to iterate over them, I get the following error:
OSError: [Errno 24] Too many open files
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f2d3c00c190>> ignored
Shouldn't zip work on all iterable items? But it seems like here I can't use it on dataloaders.
Is there any other way to pursue this? Or am I approaching the implementation of a Siamese network incorrectly?
To complete @ManojAcharya's answer:
The error you are getting comes neither from
zip()
norDataLoader()
directly. Python is trying to tell you that it couldn't find one of the data files you are asking for (c.f.FileNotFoundError
in the exception trace), probably in yourDataset
.Find below a working example using
DataLoader
andzip
together. Note that if you want to shuffle your data, it becomes difficult to keep the correspondences between the 2 datasets. This justifies @ManojAcharya's solution.Adding on @Aldream's solution for the case when we have varying length of the dataset and if we want to pass through them all at same epoch then we could use the
cycle()
fromitertools
, a Python Standard library. Using the code snippet of @Aldrem, the updated code will look like:With only
zip()
the iterator will be exhausted when the length is equal to that of the smallest dataset (here 100). But with the use ofcycle()
, we will repeat the smallest dataset again unless our iterator looks at all the samples from the largest dataset (here 200).P.S. One can always argue this approach may not be required to achieve convergence as long as one does samples randomly but with this approach, the evaluation might be easier.
I see you are struggling to make a right dataloder function. i would do:
Further to what it is already mentioned,
cycle()
andzip()
might create a memory leakage problem - especially when using image datasets! To solve that, instead of iterating like this:you could use:
Bare in mind that if you use labels as well, you should replace in this example
data1
with(inputs1,targets1)
anddata2
withinputs2,targets2
, as @Sajad Norouzi said.KUDOS to this one: https://github.com/pytorch/pytorch/issues/1917#issuecomment-433698337
If you want to iterate over two datasets simultaneously, there is no need to define your own dataset class just use TensorDataset like below:
If you want the labels or iterating over more than two datasets just feed them as an argument to the TensorDataset after dataset2.