I am currently developing a text classification tool using Keras. It works (it works fine and I got up to 98.7 validation accuracy) but I can't wrap my head around about how exactly 1D-convolution layer works with text data.
What hyper-parameters should I use?
I have the following sentences (input data):
- Maximum words in the sentence: 951 (if it's less - the paddings are added)
- Vocabulary size: ~32000
- Amount of sentences (for training): 9800
- embedding_vecor_length: 32 (how many relations each word has in word embeddings)
- batch_size: 37 (it doesn't matter for this question)
- Number of labels (classes): 4
It's a very simple model (I have made more complicated structures but, strangely it works better - even without using LSTM):
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(labels_count, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
My main question is: What hyper-parameters should I use for Conv1D layer?
model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))
If I have following input data:
- Max word count: 951
- Word-embeddings dimension: 32
Does it mean that filters=32
will only scan first 32 words completely discarding the rest (with kernel_size=2
)? And I should set filters to 951 (max amount of words in the sentence)?
Examples on images:
So for instance this is an input data: http://joxi.ru/krDGDBBiEByPJA
It's the first step of a convoulution layer (stride 2): http://joxi.ru/Y2LB099C9dWkOr
It's the second step (stride 2): http://joxi.ru/brRG699iJ3Ra1m
And if filters = 32
, layer repeats it 32 times? Am I correct?
So I won't get to say 156-th word in the sentence, and thus this information will be lost?
I would try to explain how 1D-Convolution is applied on a sequence data. I just use the example of a sentence consisting of words but obviously it is not specific to text data and it is the same with other sequence data and timeseries.
Suppose we have a sentence consisting of
m
words where each word has been represented using word embeddings:Now we would like to apply a 1D convolution layer consisting of
n
different filters with kernel size ofk
on this data. To do so, sliding windows of lengthk
are extracted from the data and then each filter is applied on each of those extracted windows. Here is an illustration of what happens (here I have assumedk=3
and removed the bias parameter of each filter for simplicity):As you can see in the figure above, the response of each filter is equivalent to the result of its dot product (i.e. element-wise multiplication and then summing all the results) with the extracted window of length
k
(i.e.i
-th to(i+k-1)
-th words in the given sentence). Further, note that each filter has the same number of channels as the number of features (i.e. word-embeddings dimension) of the training sample (hence performing dot product is possible). Essentially, each filter is detecting the presence of a particular feature of pattern in a local window of training data (e.g. whether a couple of specific words exist in this window or not). After all the filters have been applied on all the windows of lengthk
we would have an output of like this which is the result of convolution:As you can see, there are
m-k+1
windows in the figure since we have assumed that thepadding='valid'
andstride=1
(default behavior ofConv1D
layer in Keras). Thestride
argument determines how much the window should slide (i.e. shift) to extract the next window (e.g. in our example above, a stride of 2 would extract windows of words:(1,2,3), (3,4,5), (5,6,7), ...
instead). Thepadding
argument determines whether the window should entirely consists of the words in training sample or there should be paddings at the beginning and at the end; this way, the convolution response may have the same length (i.e.m
and notm-k+1
) as the training sample (e.g. in our example above,padding='same'
would extract windows of words:(PAD,1,2), (1,2,3), (2,3,4), ..., (m-2,m-1,m), (m-1,m, PAD)
).You can verify some of the things I mentioned using Keras:
Model summary:
As you can see the output of convolution layer has a shape of
(m-k+1,n) = (18, 32)
and the number of parameters (i.e. filters weights) in the convolution layer is equal to:num_filters * (kernel_size * n_features) + one_bias_per_filter = n * (k * emb_dim) + n = 32 * (3 * 100) + 32 = 9632
.