>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> t.word_index
{'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4}
I'd have expected t.word_index
to have just the top 3 words. What am I doing wrong?
Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts, word_docs. It does have effect on texts_to_matrix. The resulting matrix will have num_words (3) columns.
Just a add on Marcin's answer ("it will keep the counter of all words - even when it's obvious that it will not use it later.").
The reason it keeps counter on all words is that you can call
fit_on_texts
multiple times. Each time it will update the internal counters, and when transformations are called, it will use the top words based on the updated counters.Hope it helps.
There is nothing wrong in what you are doing.
word_index
is computed the same way no matter how many most frequent words you will use later (as you may see here). So when you will call any transformative method -Tokenizer
will use only three most common words and at the same time, it will keep the counter of all words - even when it's obvious that it will not use it later.