Python: What is the “size” parameter in Gensim Wor

2020-07-10 10:40发布

问题:

I have been struggling to understand the use of size parameter in the gensim.models.Word2Vec

From the Gensim documentation, size is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of closeness with the other words in the sentence for each word. So, suppose if my vocab size is 30 then how does it create a vector with the dimension greater than 30? Can anyone please brief me on the optimal value of Word2Vec size?

Thank you.

回答1:

size is, as you note, the dimensionality of the vector.

Word2Vec needs large, varied text examples to create its 'dense' embedding vectors per word. (It's the competition between many contrasting examples during training which allows the word-vectors to move to positions that have interesting distances and spatial-relationships with each other.)

If you only have a vocabulary of 30 words, word2vec is unlikely an appropriate technology. And if trying to apply it, you'd want to use a vector size much lower than your vocabulary size – ideally much lower. For example, texts containing many examples of each of tens-of-thousands of words might justify 100-dimensional word-vectors.

Using a higher dimensionality than vocabulary size would more-or-less guarantee 'overfitting'. The training could tend toward an idiosyncratic vector for each word – essentially like a 'one-hot' encoding – that would perform better than any other encoding, because there's no cross-word interference forced by representing a larger number of words in a smaller number of dimensions.

That'd mean a model that does about as well as possible on the Word2Vec internal nearby-word prediction task – but then awful on other downstream tasks, because there's been no generalizable relative-relations knowledge captured. (The cross-word interference is what the algorithm needs, over many training cycles, to incrementally settle into an arrangement where similar words must be similar in learned weights, and contrasting words different.)