可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am very new to ML using Big Data and I have played with Keras generic convolutional examples for the dog/cat classification before, however when applying a similar approach to my set of images, I run into memory issues.

My dataset consists of very long images that are 10048 x1687 pixels in size. To circumvent the memory issues, I am using a batch size of 1, feeding in one image at a time to the model.

The model has two convolutional layers, each followed by max-pooling which together make the flattened layer roughly 290,000 inputs right before the fully-connected layer.

Immediately after running however, Memory usage chokes at its limit (8Gb).

So my questions are the following:

1) What is the best approach to process computations of such size in Python locally (no Cloud utilization)? Are there additional python libraries that I need to utilize?

回答1:

Check out what yield does in python and the idea of generators. You do not need to load all of your data at the beginning. You should make your batch_size just small enough that you do not get memory errors. Your generator can look like this:

def generator(fileobj, labels, memory_one_pic=1024, batch_size): 
  start = 0
  end = start + batch_size
  while True:
     X_batch = fileobj.read(memory_one_pic*batch_size)
     y_batch = labels[start:end]
     start += batch_size
     end += batch_size
     if not X_batch:
        break
     if start >= amount_of_datasets:
       start = 0
       end = batch_size
     yield (X_batch, y_batch)

...later when you already have your architecture ready...

train_generator = generator(open('traindata.csv','rb'), labels, batch_size)
train_steps = amount_of_datasets//batch_size + 1

model.fit_generator(generator=train_generator,
                     steps_per_epoch=train_steps,
                     epochs=epochs)

You should also read about batch_normalization, which basically helps to learn faster and with better accuracy.

回答2:

While using train_generator(), you should also set the max_q_size parameter. It's set at 10 by default, which means you're loading in 10 batches while using only 1 (since train_generator() was designed to stream data from outside sources that can be delayed like network, not to save memory). I'd recommend setting max_q_size=1for your purposes.