How to split Vector into columns - using PySpark

2019-01-01 07:57发布

问题:

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of \"vector\" is VectorUDT.

An Example:

word    |  vector

assert  | [435,323,324,212...]

And I want to get this:

word   |  v1 | v2  | v3 | v4 | v5 | v6 ......

assert | 435 | 5435| 698| 356|....

Question:

How can I split a column with vectors in several columns for each dimension using pyspark ?

Thanks in advance

回答1:

One possible approach is to convert to and from RDD:

from pyspark.ml.linalg import Vectors

df = sc.parallelize([
    (\"assert\", Vectors.dense([1, 2, 3])),
    (\"require\", Vectors.sparse(3, {1: 2}))
]).toDF([\"word\", \"vector\"])

def extract(row):
    return (row.word, ) + tuple(row.vector.toArray().tolist())

df.rdd.map(extract).toDF([\"word\"])  # Vector values will be named _2, _3, ...

## +-------+---+---+---+
## |   word| _2| _3| _4|
## +-------+---+---+---+
## | assert|1.0|2.0|3.0|
## |require|0.0|2.0|0.0|
## +-------+---+---+---+

An alternative solution would be to create an UDF:

from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, DoubleType

def to_array(col):
    def to_array_(v):
        return v.toArray().tolist()
    return udf(to_array_, ArrayType(DoubleType()))(col)

(df
    .withColumn(\"xs\", to_array(col(\"vector\")))
    .select([\"word\"] + [col(\"xs\")[i] for i in range(3)]))

## +-------+-----+-----+-----+
## |   word|xs[0]|xs[1]|xs[2]|
## +-------+-----+-----+-----+
## | assert|  1.0|  2.0|  3.0|
## |require|  0.0|  2.0|  0.0|
## +-------+-----+-----+-----+

For Scala equivalent see Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)].