How to encode string values into numeric values in

2019-03-05 04:45发布

问题:

I have a DataFrame with two columns:

df = 
  Col1   Col2
  aaa    bbb
  ccc    aaa

I want to encode String values into numeric values. I managed to do it in this way:

import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}

val indexer1 = new StringIndexer()
                    .setInputCol("Col1")
                    .setOutputCol("Col1Index")
                    .fit(df)

val indexer2 = new StringIndexer()
                    .setInputCol("Col2")
                    .setOutputCol("Col2Index")
                    .fit(df)

val indexed1 = indexer1.transform(df)
val indexed2 = indexer2.transform(df)

val encoder1 = new OneHotEncoder()
                    .setInputCol("Col1Index")
                    .setOutputCol("Col1Vec")

val encoder2 = new OneHotEncoder()
                    .setInputCol("Col2Index")
                    .setOutputCol("Col2Vec")

val encoded1 = encoder1.transform(indexed1)
encoded1.show()

val encoded2 = encoder2.transform(indexed2)
encoded2.show()

The problem is that aaa is encoded in different ways in two columns. How can I encode my DataFrame in order to get the new one correctly encoded, e.g.:

df_encoded = 
   Col1   Col2
   1      2
   3      1

回答1:

Train single Indexer on both columns:

val df = Seq(("aaa", "bbb"), ("ccc", "aaa")).toDF("col1", "col2")

val indexer = new StringIndexer().setInputCol("col").fit(
   df.select("col1").toDF("col").union(df.select("col2").toDF("col"))
)

and apply copy on each column

import org.apache.spark.ml.param.ParamMap

val result = Seq("col1", "col2").foldLeft(df){
  (df, col) => indexer
    .copy(new ParamMap()
      .put(indexer.inputCol, col)
      .put(indexer.outputCol, s"${col}_idx"))
    .transform(df)
}

result.show
// +----+----+--------+--------+
// |col1|col2|col1_idx|col2_idx|
// +----+----+--------+--------+
// | aaa| bbb|     0.0|     1.0|
// | ccc| aaa|     2.0|     0.0|
// +----+----+--------+--------+