Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so :
| Numerical| HotEncoded1| HotEncoded2
| 14460.0| (44,[5],[1.0])| (3,[0],[1.0])|
| 14460.0| (44,[9],[1.0])| (3,[0],[1.0])|
| 15181.0| (44,[1],[1.0])| (3,[0],[1.0])|
The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes:
[(48,[0,1,9],[14460.0,1.0,1.0])]
[(48,[0,3,25],[12827.0,1.0,1.0])]
[(48,[0,1,18],[12828.0,1.0,1.0])]
I am unsure of what these numbers mean and cannot make sense of this transformed data set. Some clarification on what this output means would be great!
This output is not specific to
VectorAssembler
. It is just a string representation ofo.a.s.ml.linalg.SparseVector
(o.a.s.mllib.linalg.SparseVector
in Spark < 2.0) with:So
(48,[0,1,9],[14460.0,1.0,1.0])
represents a vector of length 48, with three non-zero entries:Pretty much the same description applies to
HotEncoded1
andHotEncoded2
andNumerical
is just a scalar. Without seeing metadata and constructors it is not possible to tell much but encoded variables should have either 44 and 3 or 45 and 4 levels (depending on adropLast
parameter).