How should we pad text sequence in keras using pad

2020-07-08 07:10发布

问题:

I have coded a sequence to sequence learning LSTM in keras myself using the knowledge gained from the web tutorials and my own intuitions. I converted my sample text to sequences and then padded using pad_sequence function in keras.

from keras.preprocessing.text import Tokenizer,base_filter
from keras.preprocessing.sequence import pad_sequences

def shift(seq, n):
    n = n % len(seq)
    return seq[n:] + seq[:n]

txt="abcdefghijklmn"*100

tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
tk.fit_on_texts(txt)
x = tk.texts_to_sequences(txt)
#shifing to left
y = shift(x,1)

#padding sequence
max_len = 100
max_features=len(tk.word_counts)
X = pad_sequences(x, maxlen=max_len)
Y = pad_sequences(y, maxlen=max_len)

After a carefully inspection I found my padded sequence looks like this

>>> X[0:6]
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7]], dtype=int32)
>>> X
array([[ 0,  0,  0, ...,  0,  0,  1],
       [ 0,  0,  0, ...,  0,  0,  3],
       [ 0,  0,  0, ...,  0,  0,  2],
       ..., 
       [ 0,  0,  0, ...,  0,  0, 13],
       [ 0,  0,  0, ...,  0,  0, 12],
       [ 0,  0,  0, ...,  0,  0, 14]], dtype=int32)

Is the padded sequence suppose to look like this? Except the last column in the array the rest are all zeros. I think I made some mistake in padding the text to sequence and if so can you tell me where I made the error?

回答1:

If you want to tokenize by char, you can do it manually, it's not too complex:

First build a vocabulary for your characters:

txt="abcdefghijklmn"*100
vocab_char = {k: (v+1) for k, v in zip(set(txt), range(len(set(txt))))}
vocab_char['<PAD>'] = 0

This will associate a distinct number for every character in your txt. The character with index 0 should be preserved for the padding.

Having the reverse vocabulary will be usefull to decode the output.

rvocab = {v: k for k, v in vocab.items()}

Once you have this, you can first split your text into sequences, say you want to have sequences of length seq_len = 13 :

[[vocab_char[char] for char in txt[i:(i+seq_len)]] for i in range(0,len(txt),seq_len)]

your output will look like :

[[9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4, 3], 
 [14, 9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4],
 ...,
 [2, 1, 5, 13, 11, 4, 3, 14, 9, 12, 6, 10, 8], 
 [7, 2, 1, 5, 13, 11, 4, 3, 14]]

Note that the last sequence doesn't have the same length, you can discard it or pad your sequence to max_len = 13, it will add 0's to it.

You can build your targets Y the same way, by shifting everything by 1. :-)

I hope this helps.



回答2:

The problem is in this line:

tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")

When you set such split (by " "), due to nature of your data, you'll get each sequence consisting of a single word. That's why your padded sequences have only one non-zero element. To change that try:

txt="a b c d e f g h i j k l m n "*100


回答3:

The argument padding controls padding either before or after each sequence. Use like this:

X = pad_sequences(x, maxlen=max_len, padding='post')
Y = pad_sequences(y, maxlen=max_len, padding='post')