Understanding Representation of Vector Column in S

2019-01-20 04:47发布

Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so :

|  Numerical|  HotEncoded1|   HotEncoded2
|  14460.0|    (44,[5],[1.0])|     (3,[0],[1.0])|
|  14460.0|    (44,[9],[1.0])|     (3,[0],[1.0])|
|  15181.0|    (44,[1],[1.0])|     (3,[0],[1.0])|

The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes:

[(48,[0,1,9],[14460.0,1.0,1.0])]
[(48,[0,3,25],[12827.0,1.0,1.0])]
[(48,[0,1,18],[12828.0,1.0,1.0])]

I am unsure of what these numbers mean and cannot make sense of this transformed data set. Some clarification on what this output means would be great!

标签： apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

1条回答

再贱就再见

2楼-- · 2019-01-20 04:53

This output is not specific to VectorAssembler. It is just a string representation of o.a.s.ml.linalg.SparseVector (o.a.s.mllib.linalg.SparseVector in Spark < 2.0) with:

leading number representing the length of a vector
the first first set of numbers in brackets is a list of non-zero indices
the second set of numbers in brackets is a list of values corresponding to the indices

So (48,[0,1,9],[14460.0,1.0,1.0]) represents a vector of length 48, with three non-zero entries:

14460.0 at the 0th position
1.0 at the 1st position
1.0 at the 9th position

Pretty much the same description applies to HotEncoded1 and HotEncoded2 and Numerical is just a scalar. Without seeing metadata and constructors it is not possible to tell much but encoded variables should have either 44 and 3 or 45 and 4 levels (depending on a dropLast parameter).

0人赞添加讨论(0) 举报

Understanding Representation of Vector Column in S

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间