Understanding format of data in scikit-learn

2019-07-20 16:34发布

问题:

I am trying to work with multi-label text classification using scikit-learn in Python 3.x. I have data in libsvm format which I am loading using load_svmlight_file module. The data format is like this.

  • 314523,165538,76255 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:2 13:1
  • 410523,230296,368303,75145 8:1 19:2 22:1 24:1 29:1 63:1 68:1 69:3 76:1 82:1 83:1 84:1

Each of these lines corresponds to one document. The first three numbers are the labels, and the next entries are feature numbers with their values. Each feature corresponds to a word.

I am loading the data using this script.

from sklearn.datasets import load_svmlight_file

X,Y = load_svmlight_file("train.csv", multilabel = True, zero_based = True)

My question is, that when I see the format of data by doing for example, print (X[0]), I get this output.

(0, 1) 1.0

(0, 2) 1.0

(0, 3) 1.0

(0, 4) 1.0

(0, 5) 1.0

(0, 6) 1.0

(0, 7) 1.0

(0, 8) 1.0

(0, 9) 1.0

(0, 10) 1.0

(0, 11) 1.0

(0, 12) 2.0

(0, 13) 1.0

I don't understand the meaning of this format. Shouldn't the format be something like this.

> 1  2  3  4  5  6  7  8  9  10  11  12  13

> 1  1  1  1  1  1  1  1  1   1   1   2   1  

I am new to scikit. I would appreciate some help in this regard.

回答1:

This has nothing to do with multilabel classification per se. The feature matrix X that you get from load_svmlight_file is a SciPy CSR matrix, as explained in the docs, and those print in a rather unfortunate format:

>>> from scipy.sparse import csr_matrix
>>> X = csr_matrix([[0, 0, 1], [2, 3, 0]])
>>> X
<2x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> X.toarray()
array([[0, 0, 1],
       [2, 3, 0]])
>>> print(X)
  (0, 2)    1
  (1, 0)    2
  (1, 1)    3