Cannot convert type into Vector

2019-01-25 16:11发布

问题:

Given my pyspark Row object:

>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>

However, row.features failed to pass isinstance(row.features,Vector) test.

>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False

This strange error made me in huge trouble. Without passing "isinstance(row.features, Vector)," I am not able to generate LabeledPoint using map function. I will be really grateful if anyone can solve this problem.

回答1:

It is is unlikely an error. You didn't provide a code required to reproduce the issue but most likely you use Spark 2.0 with ML transformers and you compare wrong entities.

Let's illustrate that with an example. Simple data

from pyspark.ml.feature import OneHotEncoder

row = OneHotEncoder(inputCol="x", outputCol="features").transform(
    sc.parallelize([(1.0, )]).toDF(["x"])
).first()

Now lets import different vector classes:

from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import  LabeledPoint

and make tests:

isinstance(row.features, MLLibVector)
False
isinstance(row.features, MLVector)
True

As you see what we have is pyspark.ml.linalg.Vector not pyspark.mllib.linalg.Vector which is not compatible with the old API:

LabeledPoint(0.0, row.features)
TypeError                                 Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

You could convert ML object to MLLib one:

from pyspark.ml import linalg as ml_linalg

def as_mllib(v):
    if isinstance(v, ml_linalg.SparseVector):
        return MLLibVectors.sparse(v.size, v.indices, v.values)
    elif isinstance(v, ml_linalg.DenseVector):
        return MLLibVectors.dense(v.toArray())
    else:
        raise TypeError("Unsupported type: {0}".format(type(v)))

LabeledPoint(0, as_mllib(row.features))
LabeledPoint(0.0, (1,[],[]))

or simply:

LabeledPoint(0, MLLibVectors.fromML(row.features))
LabeledPoint(0.0, (1,[],[]))

but generally speaking you should avoid situations when it is necessary.



回答2:

If you just want to convert the SparseVectors from pyspark.ml to pyspark.mllib SparseVectors you could use MLUtils. Say df is your dataframe and the column with the SparseVectors is named "features". Then the following few lines let you accomplish this:

from pyspark.mllib.utils import MLUtils
df = MLUtils.convertVectorColumnsFromML(df, "features")

This problem occured for me, because when using CountVectorizer from pyspark.ml.feature I could not create LabeledPoints, because of the incompatibility with SparseVector from pyspark.ml

I wonder why their latest documentation CountVectorizer does not use the "new" SparseVector class. As the classification algorithms need LabeledPoints this makes no sense to me...

UPDATE: I misunderstood that the ml library is designed for DataFrame-Objects and the mllib library is for RDD-objects. The DataFrame-Datastructure is recommended since Spark > 2,0, because SparkSession is more compatible than the SparkContext (but stores an SparkContext-object) and does deliver DataFrame instead of RDD's. I found this post that got me the "aha"-effect: mllib and ml. Thanks Alberto Bonsanto :).

To use f.e. NaiveBayes from mllib, I had to transform my DataFrame into a LabeledPoint-objects for NaiveBayes from mllib.

But it is way easier to use NaiveBayes from ml because you don't need LabeledPoints but can just specify the feature- and class-col for your dataframe.

PS: I was struggling with this problems for hours, so I felt that I need to post it here :)