I have about 0.8 million images of 256x256 in RGB, which amount to over 7GB.
I want to use them as training data in a Convolutional Neural Network, and want to put them in a cPickle file, along with their labels.
Now, this is taking a lot of memory, to the extent that it needs to swap with my hard drive memory, and almost consume it all.
Is this is a bad idea?
What would be the smarter/more practical way to load into CNN or pickle them without causing too much memory issue?
This is what the code looks like
import numpy as np
import cPickle
from PIL import Image
import sys,os
pixels = []
labels = []
traindata = []
data=[]
for subdir, dirs, files in os.walk('images'):
curdir=''
for file in files:
if file.endswith(".jpg"):
floc=str(subdir)+'/'+str(file)
im= Image.open(floc)
pix=np.array(im.getdata())
pixels.append(pix)
labels.append(1)
pixels=np.array(pixels)
labels=np.array(labels)
traindata.append(pixels)
traindata.append(labels)
traindata=np.array(traindata)
.....# do the same for validation and test data
.....# put all data and labels into 'data' array
cPickle.dump(data,open('data.pkl','wb'))
Yes, indeed.
You are trying to load 7GB of compressed image data into memory all at once (about 195 GB for 800k 256*256 RGB files). This will not work. You have to find a way to update your CNN image-by-image, saving the state as you go along.
Also consider how large your CCN parameter set will be. Pickle is not designed for large amounts of data. If you need to store GB worth of neural net data, you're much better off using a database. If the neural net parameter set is only a few MB, pickle will be fine, though.
You might also want to take a look at the documentation for
pickle.HIGHEST_PROTOCOL
, so you are not stuck with an old and unoptimized pickle file format.