Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so :
| Numerical| HotEncoded1| HotEncoded2
| 14460.0| (44,[5],[1.0])| (3,[0],[1.0])|
| 14460.0| (44,[9],[1.0])| (3,[0],[1.0])|
| 15181.0| (44,[1],[1.0])| (3,[0],[1.0])|
The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes:
[(48,[0,1,9],[14460.0,1.0,1.0])]
[(48,[0,3,25],[12827.0,1.0,1.0])]
[(48,[0,1,18],[12828.0,1.0,1.0])]
I am unsure of what these numbers mean and cannot make sense of this transformed data set. Some clarification on what this output means would be great!
This output is not specific to VectorAssembler
. It is just a string representation of o.a.s.ml.linalg.SparseVector
(o.a.s.mllib.linalg.SparseVector
in Spark < 2.0) with:
- leading number representing the length of a vector
- the first first set of numbers in brackets is a list of non-zero indices
- the second set of numbers in brackets is a list of values corresponding to the indices
So (48,[0,1,9],[14460.0,1.0,1.0])
represents a vector of length 48, with three non-zero entries:
- 14460.0 at the 0th position
- 1.0 at the 1st position
- 1.0 at the 9th position
Pretty much the same description applies to HotEncoded1
and HotEncoded2
and Numerical
is just a scalar. Without seeing metadata and constructors it is not possible to tell much but encoded variables should have either 44 and 3 or 45 and 4 levels (depending on a dropLast
parameter).