Context: I have a data frame with two columns: label, and features.
org.apache.spark.sql.DataFrame = [label: int, features: vector]
Where features is a mllib.linalg.VectorUDT of numeric type built using VectorAssembler.
Question: Is there a way to assign a schema to the features vector? I want to keep track of the name of each feature.
Tried so far:
val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array("feat1", "feat2", "feat3").map(defaultAttr.withName)
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])
scala> attrGroup.toMetadata
res197: org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"f1"},{"idx":1,"name":"f2"},{"idx":2,"name":"f3"}]},"num_attrs":3}}
But was not sure how to apply this to an existing data frame.
There at lest two options:
On existing
DataFrame
you can useas
method withmetadata
argument:When you create new
DataFrame
convertAttributeGroup
toStructField
and use it as a schema for a given column:If vector column has been created using
VectorAssembler
column metadata describing parent columns should be already attached.Vector fields are not directly accessible using dot syntax (like
$features.feat1
) but can used by specialized tools likeVectorSlicer
:For PySpark see How can I declare a Column as a categorical feature in a DataFrame for use in ml