I am experiencing a very strange behaviour from VectorAssembler
and I was wondering if anyone else has seen this.
My scenario is pretty straightforward. I parse data from a CSV
file where I have some standard Int
and Double
fields and I also calculate some extra columns. My parsing function returns this:
val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))
My main function uses the parsing function like this:
val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")
I then use a VectorAssembler
like this:
val assembler = new VectorAssembler()
.setInputCols(Array("orderNo", "pageNo", "joinedCounts"))
.setOutputCol("features")
val assemblerData = assembler.transform(data)
So when I print a row of my data before it goes into the VectorAssembler
it looks like this:
[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]
After the transform function of VectorAssembler I print the same row of data and get this:
[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]
What on earth is going on? What has the VectorAssembler
done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?
There is nothing strange about the output. Your vector seems to have lots of zero elements thus
spark
used it’s sparse representation.To explain further :
It seems like your vector is composed of 18 elements (dimension).
This indices
[0,1,6,9,14,17]
from the vector contains non zero elements which are in order[17.0,15.0,3.0,1.0,4.0,2.0]
Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.
Now of course you can convert that sparse representation to a dense representation but it comes at a cost.
In case you are interested in getting feature importance, thus I advise you to take a look at this.