Is there a library that will take a list of documents and en masse compute the nxn matrix of distances - where the word2vec model is supplied? I can see that genism allows you to do this between two documents - but I need a fast comparison across all docs. like sklearns cosine_similarity.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
The "Word Mover's Distance" (earth-mover's distance applied to groups of word-vectors) is a fairly involved optimization calculation dependent on every word in each document.
I'm not aware of any tricks that would help it go faster when calculating many at once – even many distances to the same document.
So the only thing needed to calculate pairwise distances are nested loops to consider each (order-ignoring unique) pairing.
For example, assuming your list of documents (each a list-of-words) is
docs
, a gensim word-vector model inmodel
, andnumpy
imported asnp
, you could calculate the array of pairwise distances D with:It may take a while, but you'll then have all pairwise distances in array D.