How to change column metadata in pyspark?

2019-01-28 07:14发布

问题:

How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back in automated way. Writing the metadata in pyspark API is not directly available unless you recreate the schema. Is it possible to edit metadata in PySpark on the go without converting dataset to RDD and converting it back, provided complete schema description (as described here)?

Example listing:

# Create DF
df.show()

# +---+-------------+
# | id|     features|
# +---+-------------+
# |  0|[1.0,1.0,4.0]|
# |  1|[2.0,2.0,4.0]|
# +---+-------------+
# - That one has all the necessary metadata about what is encoded in feature column

# Slice one feature out
df = VectorSlicer(inputCol='features', outputCol='categoryIndex', indices=[1]).transform(df)
df = df.drop('features')
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# |  0|        [1.0]|
# |  1|        [2.0]|
# +---+-------------+
# categoryIndex now carries metadata about singular array with encoding

# Get rid of the singular array
udf = UserDefinedFunction(lambda x: float(x[0]), returnType=DoubleType())
df2 = df.select(*[udf(column).alias(column) if column == 'categoryIndex' else column for column in df.columns])
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# |  0|          1.0|
# |  1|          2.0|
# +---+-------------+
# - Metadata is lost for that one


# Write metadata
extract = {...}
df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)
# metadata is readable from df2.schema.fields[1].metadata but is not affective. 
# Saving and restoring df from parque destroys the change
# Decode categorical
df = IndexToString(inputCol="categoryIndex", outputCol="category").transform(df)
# ERROR. Was supposed to decode the categorical values

Question provides an insight about how to work with VectorAssembler, VectorIndexer and how to add metadata by constructing a complete schema using StructType but yet does not answer my question.

回答1:

In both cases losing metadata is expected:

  • When you call Python udf there is no relationship between input Column and its metadata, and output Column. UserDefinedFunction (both in Python and Scala) are black boxes for the Spark engine.
  • Assigning data directly to the Python schema object:

    df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)
    

    is not a valid approach at all. Spark DataFrame is a thing wrapper around JVM object. Any changes in the Python wrappers, are completely opaque for JVM backend, and won't be propagated at all:

    import json 
    
    df = spark.createDataFrame([(1, "foo")], ("k", "v"))
    df.schema[-1].metadata = {"foo": "bar"}
    
    json.loads(df._jdf.schema().json())
    
    ## {'fields': [{'metadata': {}, 'name': 'k', 'nullable': True, 'type': 'long'},
    ##   {'metadata': {}, 'name': 'v', 'nullable': True, 'type': 'string'}],
    ## 'type': 'struct'}
    

    or even preserved in Python:

    df.select("*").schema[-1].metadata
    ## {}
    

With Spark < 2.2 you can use a small wrapper (taken from Spark Gotchas, maintained by me and @eliasah):

def withMeta(self, alias, meta):
    sc = SparkContext._active_spark_context
    jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
    return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))

df.withColumn("foo", withMeta(col("foo"), "", {...}))

With Spark >= 2.2 you can use Column.alias:

df.withColumn("foo", col("foo").alias("", metadata={...}))