Cannot convert type <class> into Vector

Given my pyspark Row object:

>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>

However, row.features failed to pass isinstance(row.features,Vector) test.

>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False

This strange error made me in huge trouble. Without passing "isinstance(row.features, Vector)," I am not able to generate LabeledPoint using map function. I will be really grateful if anyone can solve this problem.

标签： apache-spark pyspark apache-spark-sql apache-spark-mllib apache-spark-ml

2条回答

男人必须洒脱

2楼-- · 2019-01-25 16:03

If you just want to convert the SparseVectors from pyspark.ml to pyspark.mllib SparseVectors you could use MLUtils. Say df is your dataframe and the column with the SparseVectors is named "features". Then the following few lines let you accomplish this:

from pyspark.mllib.utils import MLUtils
df = MLUtils.convertVectorColumnsFromML(df, "features")

This problem occured for me, because when using CountVectorizer from pyspark.ml.feature I could not create LabeledPoints, because of the incompatibility with SparseVector from pyspark.ml

I wonder why their latest documentation CountVectorizer does not use the "new" SparseVector class. As the classification algorithms need LabeledPoints this makes no sense to me...

UPDATE: I misunderstood that the ml library is designed for DataFrame-Objects and the mllib library is for RDD-objects. The DataFrame-Datastructure is recommended since Spark > 2,0, because SparkSession is more compatible than the SparkContext (but stores an SparkContext-object) and does deliver DataFrame instead of RDD's. I found this post that got me the "aha"-effect: mllib and ml. Thanks Alberto Bonsanto :).

To use f.e. NaiveBayes from mllib, I had to transform my DataFrame into a LabeledPoint-objects for NaiveBayes from mllib.

But it is way easier to use NaiveBayes from ml because you don't need LabeledPoints but can just specify the feature- and class-col for your dataframe.

PS: I was struggling with this problems for hours, so I felt that I need to post it here :)

0人赞添加讨论(0) 举报

三岁会撩人

3楼-- · 2019-01-25 16:20

It is is unlikely an error. You didn't provide a code required to reproduce the issue but most likely you use Spark 2.0 with ML transformers and you compare wrong entities.

Let's illustrate that with an example. Simple data

from pyspark.ml.feature import OneHotEncoder

row = OneHotEncoder(inputCol="x", outputCol="features").transform(
    sc.parallelize([(1.0, )]).toDF(["x"])
).first()

Now lets import different vector classes:

from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import  LabeledPoint

and make tests:

isinstance(row.features, MLLibVector)

False

isinstance(row.features, MLVector)

True

As you see what we have is pyspark.ml.linalg.Vector not pyspark.mllib.linalg.Vector which is not compatible with the old API:

LabeledPoint(0.0, row.features)

TypeError                                 Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

You could convert ML object to MLLib one:

from pyspark.ml import linalg as ml_linalg

def as_mllib(v):
    if isinstance(v, ml_linalg.SparseVector):
        return MLLibVectors.sparse(v.size, v.indices, v.values)
    elif isinstance(v, ml_linalg.DenseVector):
        return MLLibVectors.dense(v.toArray())
    else:
        raise TypeError("Unsupported type: {0}".format(type(v)))

LabeledPoint(0, as_mllib(row.features))

LabeledPoint(0.0, (1,[],[]))

or simply:

LabeledPoint(0, MLLibVectors.fromML(row.features))

LabeledPoint(0.0, (1,[],[]))

but generally speaking you should avoid situations when it is necessary.

0人赞添加讨论(0) 举报

Cannot convert type into Vector

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间