Understanding format of data in scikit-learn

I am trying to work with multi-label text classification using scikit-learn in Python 3.x. I have data in libsvm format which I am loading using load_svmlight_file module. The data format is like this.

314523,165538,76255 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:2 13:1

410523,230296,368303,75145 8:1 19:2 22:1 24:1 29:1 63:1 68:1 69:3 76:1 82:1 83:1 84:1

Each of these lines corresponds to one document. The first three numbers are the labels, and the next entries are feature numbers with their values. Each feature corresponds to a word.

I am loading the data using this script.

from sklearn.datasets import load_svmlight_file

X,Y = load_svmlight_file("train.csv", multilabel = True, zero_based = True)

My question is, that when I see the format of data by doing for example, print (X[0]), I get this output.

(0, 1) 1.0

(0, 2) 1.0

(0, 3) 1.0

(0, 4) 1.0

(0, 5) 1.0

(0, 6) 1.0

(0, 7) 1.0

(0, 8) 1.0

(0, 9) 1.0

(0, 10) 1.0

(0, 11) 1.0

(0, 12) 2.0

(0, 13) 1.0

I don't understand the meaning of this format. Shouldn't the format be something like this.

> 1  2  3  4  5  6  7  8  9  10  11  12  13

> 1  1  1  1  1  1  1  1  1   1   1   2   1

I am new to scikit. I would appreciate some help in this regard.

标签： python numpy machine-learning scipy scikit-learn

1条回答

做自己的国王

2楼-- · 2019-07-20 16:41

This has nothing to do with multilabel classification per se. The feature matrix X that you get from load_svmlight_file is a SciPy CSR matrix, as explained in the docs, and those print in a rather unfortunate format:

>>> from scipy.sparse import csr_matrix
>>> X = csr_matrix([[0, 0, 1], [2, 3, 0]])
>>> X
<2x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> X.toarray()
array([[0, 0, 1],
       [2, 3, 0]])
>>> print(X)
  (0, 2)    1
  (1, 0)    2
  (1, 1)    3

0人赞添加讨论(0) 举报

Understanding format of data in scikit-learn

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间