I have a list of names like:
names = ['A', 'B', 'C', 'D']
and a list of documents, that in each documents some of these names are mentioned.
document =[['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']]
I would like to get an output as a matrix of co-occurrences like:
A B C D
A 0 2 1 1
B 2 0 2 1
C 1 2 0 1
D 1 1 1 0
There is a solution (Creating co-occurrence matrix) for this problem in R, but I couldn't do it in Python. I am thinking of doing it in Pandas, but yet no progress!
Another option is to use the constructor
csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
from scipy.sparse.csr_matrix wheredata
,row_ind
andcol_ind
satisfy the relationshipa[row_ind[k], col_ind[k]] = data[k]
.The trick is to generate
row_ind
andcol_ind
by iterating over the documents and creating a list of tuples (doc_id, word_id).data
would simply be a vector of ones of the same length.Multiplying the docs-words matrix by its transpose would give you the co-occurences matrix.
Additionally, this is efficient in terms of both run times and memory usage, so it should also handle big corpuses.
Run example:
Output:
I was facing the same issue... So i came with this code. This code takes into account context window and then determines co_occurance matrix.
Hope this helps you...
You can also use matrix tricks in order to find the co-occurrence matrix too. Hope this works well when you have bigger vocabulary.
Now, you can find coocurrence matrix by simple multiply
X.T
withX
Obviously this can be extended for your purposes, but it performs the general operation in mind:
'''for a window of 2, data_corpus is the series consisting of text data, words is the list consisting of words for which co-occurence matrix is build'''
"co_oc is the co-occurence matrix"
Output;