I have a dataset of 65668 files.
I am using Keras for a CNN, and these are my layers:
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(256, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)
First embedding layer is trained on GloVE.6B.100d.
Fitting the data:
# fitting the data
model.fit(x_train, y_train, validation_data=(x_val, y_val),
epochs=20, batch_size=128)
The MAX_SEQUENCE_LENGTH
is 500.
I am training on the GPU, Nvidia GeForce 940MX,
and I get the following error as part of the stack :
Resource exhausted: OOM when allocating tensor with shape[15318793,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
I tried reducing batch size to 16, even 8 and I still get the same error.
What can the issue be?
The problem lies in your Embedding
. It needs to allocate a matrix of size 15318793 * 100 * 4 bytes = 5.7 GB
which is definitely greater than GeForce 940 MX
memory. There are few ways on how you could overcome this issue:
Decrease the vocabulary/corpus size: Try to take e.g. 1M most frequent words instead of full words set. This will drastically decrease the embedding matrix size.
Use generators instead of Embedding
: Instead of using Embedding
you could use a generator to transform your sequences into word vectors sequences.
Use linear transformation of Embedding
instead of retraining your embedding - as you mentioned that with flag trainable=False
made your algorithm working you can set it to False
and add:
Dense(new_embedding_size, activation='linear')(embedding)
to train a new embedding based on existing one.
Change device - if you have huge RAM
memory you can try the following strategy:
with tf.device('/cpu:0'):
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
In this design computations of Embedding
layer would be made using CPU
and RAM
. The downside is the fact that transfer between RAM
and GPU
might be really slow.
It is very unlikely that you have a dataset large enough to cover all the words in the GloVE embeddings, so setting the embedding to be trainable might cover only some percent of the embedding, so when you train it, those embeddings will move to a slightly different space, but the untouched ones will remain in the original GloVE space. Try setting trainable=False and fix the problem by performing a linear transformation such as:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = TimeDistributed(Dense(EMBEDDING_DIM))(embedded_sequences)
x = Conv1D(128, 5, activation='relu')(x)
as another commenter said.
This is important because if you use this for inference in production, one of the untouched embeddings might make the model go quite nuts.
The linear transformation moves the embedding space around and this will work in such a way that it ideally should perform acceptably to unseen data.