I am learning multi label classification and trying to implement the tfidf tutorial from scikit learning. I am dealing with a text corpus to calculate its tf-idf score. I am using the module sklearn.feature_extraction.text for the purpose.Using CountVectorizer and TfidfTransformer I have now my corpus vectorised and tfidf for each vocabulary. The problem is that I am having a sparse matrix now, like:
(0, 47) 0.104275891915
(0, 383) 0.084129133023
.
.
.
.
(4, 308) 0.0285015996586
(4, 199) 0.0285015996586
I want to convert this sparse.csr.csr_matrix into a list of lists so that I can get rid of the document id from the above csr_matrix and get the tfidf and vocabularyId pair like
47:0.104275891915 383:0.084129133023
.
.
.
.
308:0.0285015996586
199:0.0285015996586
Is there any way to convert into a list of lists or any other way with which i can change the format to get tfidf-vocabularyId pair ?
I don't know what
tf-idf
expects, but I may be able help with the sparse end.Make a sparse matrix:
Now convert it to
coo
format. This is already that (I could have given therandom
a format parameter). In any case the values incoo
format are stored in 3 arrays:Looks like you want to ignore
Mc.row
, and somehow join the others.For example as a dictionary:
or a columns in a 2d array:
(Also
np.array((Mc.col, Mc.data)).T
)Or as just a list of arrays
[Mc.col, Mc.data]
, or[Mc.col.tolist(), Mc.data.tolist()]
list of lists, etc.Can you take it from there?