I am very new to ML using Big Data and I have played with Keras generic convolutional examples for the dog/cat classification before, however when applying a similar approach to my set of images, I run into memory issues.
My dataset consists of very long images that are 10048 x1687 pixels in size. To circumvent the memory issues, I am using a batch size of 1, feeding in one image at a time to the model.
The model has two convolutional layers, each followed by max-pooling which together make the flattened layer roughly 290,000 inputs right before the fully-connected layer.
Immediately after running however, Memory usage chokes at its limit (8Gb).
So my questions are the following:
1) What is the best approach to process computations of such size in Python locally (no Cloud utilization)? Are there additional python libraries that I need to utilize?
Check out what yield
does in python and the idea of generators. You do not need to load all of your data at the beginning. You should make your batch_size
just small enough that you do not get memory errors.
Your generator can look like this:
def generator(fileobj, labels, memory_one_pic=1024, batch_size):
start = 0
end = start + batch_size
while True:
X_batch = fileobj.read(memory_one_pic*batch_size)
y_batch = labels[start:end]
start += batch_size
end += batch_size
if not X_batch:
break
if start >= amount_of_datasets:
start = 0
end = batch_size
yield (X_batch, y_batch)
...later when you already have your architecture ready...
train_generator = generator(open('traindata.csv','rb'), labels, batch_size)
train_steps = amount_of_datasets//batch_size + 1
model.fit_generator(generator=train_generator,
steps_per_epoch=train_steps,
epochs=epochs)
You should also read about batch_normalization
, which basically helps to learn faster and with better accuracy.
While using train_generator()
, you should also set the max_q_size
parameter. It's set at 10 by default, which means you're loading in 10 batches while using only 1 (since train_generator()
was designed to stream data from outside sources that can be delayed like network, not to save memory). I'd recommend setting max_q_size=1
for your purposes.