I am looking for a module in sklearn that lets you derive the word-word co-ocurrence matrix. I can get the document-term matrix but not sure how to go about obtaining a word-word matrix of co-occurences.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Extract matrix elements using a vector of column i
- Evil ctypes hack in python
You can use the
ngram_range
parameter in theCountVectorizer
orTfidfVectorizer
Code example:
In case you want to explicitly say which co-occurrences of words you want to count, use the
vocabulary
param, i.e:vocabulary = {'awesome unicorns':0, 'batman forever':1}
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of
awesome unicorns
andbatman forever
:Final output is
('awesome unicorns', 1), ('batman forever', 2)
, which corresponds exactly to oursamples
provided data.Here is my example solution using
CountVectorizer
in scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.You can also refer to dictionary of words in
count_model
,Or, if you want to normalize by diagonal component (referred to answer in previous post).
Extra to note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.
@titipata I think your solution is not a good metric because we are giving the same weight to real co-ocurrences and to occurrences that are just spurious. For example, if I have 5 texts and the words apple and house appears with this frecuency:
text1: apple:10, "house":1
text2: apple:10, "house":0
text3: apple:10, "house":0
text4: apple:10, "house":0
text5: apple:10, "house":0
The co-occurrence we are going to measure is 10*1+10*0+10*0+10*0+10*0=10, but is just spurious.
And, in this another important cases, like the following:
text1: apple:1, "banana":1
text2: apple:1, "banana":1
text3: apple:1, "banana":1
text4: apple:1, "banana":1
text5: apple:1, "banana":1
we are going to get just a co-occurrence of 1*1+1*1+1*1+1*1=5, when in fact that co-occurrence really important.
@Guiem Bosch In this case co-occurrences are measured only when the two words are contiguous.
I propose to use something the @titipa solution to compute the matrix:
where, instead of using X, use a matrix Y with ones in positions greater than 0 and zeros in another positions.
Using this, in the first example we are going to have: co-occurrence:1*1+1*0+1*0+1*0+1*0=1 and in the second example: co-occurrence:1*1+1*1+1*1+1*1+1*0=5 which is what we are really looking for.