I have a text corpus with say 5 documents, every document is separated with each other by /n. I want to provide an id to every word in the document and calculate its respective tfidf
score.
for example, suppose we have a text corpus named "corpus.txt" as follows:-
"Stack over flow text vectorization scikit python scipy sparse csr" while calculating the tfidf using
mylist =list("corpus.text")
vectorizer= CountVectorizer
x_counts = vectorizer_train.fit_transform(mylist)
tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)
the output is
(0,12) 0.1234 #for 1st document
(1,8) 0.3456 #for 2nd document
(1,4) 0.8976
(2,15) 0.6754 #for third document
(2,14) 0.2389
(2,3) 0.7823
(3,11) 0.9897 #for fourth document
(3,13) 0.8213
(3,5) 0.7722
(3,6) 0.2211
(4,7) 0.1100 # for fifth document
(4,10) 0.6690
(4,2) 0.0912
(4,9) 0.2345
(4,1) 0.1234
I converted this scipy.sparse.csr
matrix into a list of lists to remove the document id, and keeping only the vocabulary_id and its respective tfidf
score using:
m = x_tfidf.tocoo()
mydata = {k: v for k, v in zip(m.col, m.data)}
key_val_pairs = [str(k) + ":" + str(v) for k, v in mydata.items()]
but the problem is that I am getting an output where the vocabulary_id and its respective tfidf
score is arranged in ascending order and without any reference to document.
For example, for the above given corpus my current output(I have dumped into a text file using json) looks like:
1:0.1234
2:0.0912
3:0.7823
4:0.8976
5:0.7722
6:0.2211
7:0.1100
8:0.3456
9:0.2345
10:0.6690
11:0.9897
12:0.1234
13:0.8213
14:0.2389
15:0.6754
whereas I would have want my text file to be like as follows:
12:0.1234
8:0.3456 4:0.8976
15:0.1234 14:0.2389 3:0.7823
11:0.9897 13:0.8213 5:0.7722 6:0.2211
7:0.1100 10:0.6690 2:0.0912 9:0.2345 1:0.1234
any idea how to get it done ?