I am trying to work with multi-label text classification using scikit-learn in Python 3.x. I have data in libsvm format which I am loading using load_svmlight_file
module. The data format is like this.
- 314523,165538,76255 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:2 13:1
- 410523,230296,368303,75145 8:1 19:2 22:1 24:1 29:1 63:1 68:1 69:3 76:1 82:1 83:1 84:1
Each of these lines corresponds to one document. The first three numbers are the labels, and the next entries are feature numbers with their values. Each feature corresponds to a word.
I am loading the data using this script.
from sklearn.datasets import load_svmlight_file
X,Y = load_svmlight_file("train.csv", multilabel = True, zero_based = True)
My question is, that when I see the format of data by doing for example, print (X[0])
, I get this output.
(0, 1) 1.0
(0, 2) 1.0
(0, 3) 1.0
(0, 4) 1.0
(0, 5) 1.0
(0, 6) 1.0
(0, 7) 1.0
(0, 8) 1.0
(0, 9) 1.0
(0, 10) 1.0
(0, 11) 1.0
(0, 12) 2.0
(0, 13) 1.0
I don't understand the meaning of this format. Shouldn't the format be something like this.
> 1 2 3 4 5 6 7 8 9 10 11 12 13 > 1 1 1 1 1 1 1 1 1 1 1 2 1
I am new to scikit. I would appreciate some help in this regard.