I am learning multi label classification and trying to implement the tfidf tutorial from scikit learning.
I am dealing with a text corpus to calculate its tf-idf score.
I am using the module sklearn.feature_extraction.text for the purpose.Using CountVectorizer and TfidfTransformer I have now my corpus vectorised and tfidf for each vocabulary.
The problem is that I am having a sparse matrix now, like:
(0, 47) 0.104275891915
(0, 383) 0.084129133023
.
.
.
.
(4, 308) 0.0285015996586
(4, 199) 0.0285015996586
I want to convert this sparse.csr.csr_matrix into a list of lists so that I can get rid of the document id from the above csr_matrix and get the tfidf and vocabularyId pair like
47:0.104275891915 383:0.084129133023
.
.
.
.
308:0.0285015996586
199:0.0285015996586
Is there any way to convert into a list of lists or any other way with which i can change the format to get tfidf-vocabularyId pair ?
I don't know what tf-idf
expects, but I may be able help with the sparse end.
Make a sparse matrix:
In [526]: M=sparse.random(4,10,.1)
In [527]: M
Out[527]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [528]: print(M)
(3, 1) 0.281301619779
(2, 6) 0.830780358032
(1, 1) 0.242503399296
(2, 2) 0.190933579917
Now convert it to coo
format. This is already that (I could have given the random
a format parameter). In any case the values in coo
format are stored in 3 arrays:
In [529]: Mc=M.tocoo()
In [530]: Mc.data
Out[530]: array([ 0.28130162, 0.83078036, 0.2425034 , 0.19093358])
In [532]: Mc.row
Out[532]: array([3, 2, 1, 2], dtype=int32)
In [533]: Mc.col
Out[533]: array([1, 6, 1, 2], dtype=int32)
Looks like you want to ignore Mc.row
, and somehow join the others.
For example as a dictionary:
In [534]: {k:v for k,v in zip(Mc.col, Mc.data)}
Out[534]: {1: 0.24250339929583264, 2: 0.19093357991697379, 6: 0.83078035803205375}
or a columns in a 2d array:
In [535]: np.column_stack((Mc.col, Mc.data))
Out[535]:
array([[ 1. , 0.28130162],
[ 6. , 0.83078036],
[ 1. , 0.2425034 ],
[ 2. , 0.19093358]])
(Also np.array((Mc.col, Mc.data)).T
)
Or as just a list of arrays [Mc.col, Mc.data]
, or [Mc.col.tolist(), Mc.data.tolist()]
list of lists, etc.
Can you take it from there?