I am trying to standardize (mean = 0, std = 1) one column ('age') in my data frame. Below is my code in Spark (Python):
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
# Make my 'age' column an assembler type:
age_assembler = VectorAssembler(inputCols= ['age'], outputCol = "age_feature")
# Create a scaler that takes 'age_feature' as an input column:
scaler = StandardScaler(inputCol="age_feature", outputCol="age_scaled",
withStd=True, withMean=True)
# Creating a mini-pipeline for those 2 steps:
age_pipeline = Pipeline(stages=[age_assembler, scaler])
scaled = age_pipeline.fit(sample17)
sample17_scaled = scaled.transform(sample17)
type(sample17_scaled)
It seems to run just fine. And the very last line produces: "sample17_scaled:pyspark.sql.dataframe.DataFrame"
But when I run the line below it shows that the new column age_scaled is of type 'vector': |-- age_scaled: vector (nullable = true)
sample17_scaled.printSchema()
How can I calcualate anything using this new column? For example, I can't calculate a mean. When I try, it says it should be 'long' and not udt.
Thank you very much!