Given my pyspark Row object:
>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>
However, row.features failed to pass isinstance(row.features,Vector) test.
>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False
This strange error made me in huge trouble. Without passing "isinstance(row.features, Vector)," I am not able to generate LabeledPoint using map function. I will be really grateful if anyone can solve this problem.
If you just want to convert the SparseVectors from pyspark.ml to pyspark.mllib SparseVectors you could use MLUtils. Say df is your dataframe and the column with the SparseVectors is named "features". Then the following few lines let you accomplish this:
This problem occured for me, because when using CountVectorizer from pyspark.ml.feature I could not create LabeledPoints, because of the incompatibility with SparseVector from pyspark.ml
I wonder why their latest documentation CountVectorizer does not use the "new" SparseVector class. As the classification algorithms need LabeledPoints this makes no sense to me...
UPDATE: I misunderstood that the ml library is designed for DataFrame-Objects and the mllib library is for RDD-objects. The DataFrame-Datastructure is recommended since Spark > 2,0, because SparkSession is more compatible than the SparkContext (but stores an SparkContext-object) and does deliver DataFrame instead of RDD's. I found this post that got me the "aha"-effect: mllib and ml. Thanks Alberto Bonsanto :).
To use f.e. NaiveBayes from mllib, I had to transform my DataFrame into a LabeledPoint-objects for NaiveBayes from mllib.
But it is way easier to use NaiveBayes from ml because you don't need LabeledPoints but can just specify the feature- and class-col for your dataframe.
PS: I was struggling with this problems for hours, so I felt that I need to post it here :)
It is is unlikely an error. You didn't provide a code required to reproduce the issue but most likely you use Spark 2.0 with ML transformers and you compare wrong entities.
Let's illustrate that with an example. Simple data
Now lets import different vector classes:
and make tests:
As you see what we have is
pyspark.ml.linalg.Vector
notpyspark.mllib.linalg.Vector
which is not compatible with the old API:You could convert ML object to MLLib one:
or simply:
but generally speaking you should avoid situations when it is necessary.