I am trying to work with multi-label text classification using scikit-learn in Python 3.x. I have data in libsvm format which I am loading using load_svmlight_file
module. The data format is like this.
- 314523,165538,76255 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:2 13:1
- 410523,230296,368303,75145 8:1 19:2 22:1 24:1 29:1 63:1 68:1 69:3 76:1 82:1 83:1 84:1
Each of these lines corresponds to one document. The first three numbers are the labels, and the next entries are feature numbers with their values. Each feature corresponds to a word.
I am loading the data using this script.
from sklearn.datasets import load_svmlight_file
X,Y = load_svmlight_file("train.csv", multilabel = True, zero_based = True)
My question is, that when I see the format of data by doing for example, print (X[0])
, I get this output.
(0, 1) 1.0
(0, 2) 1.0
(0, 3) 1.0
(0, 4) 1.0
(0, 5) 1.0
(0, 6) 1.0
(0, 7) 1.0
(0, 8) 1.0
(0, 9) 1.0
(0, 10) 1.0
(0, 11) 1.0
(0, 12) 2.0
(0, 13) 1.0
I don't understand the meaning of this format. Shouldn't the format be something like this.
> 1 2 3 4 5 6 7 8 9 10 11 12 13 > 1 1 1 1 1 1 1 1 1 1 1 2 1
I am new to scikit. I would appreciate some help in this regard.
This has nothing to do with multilabel classification per se. The feature matrix
X
that you get fromload_svmlight_file
is a SciPy CSR matrix, as explained in the docs, and those print in a rather unfortunate format: