How to map variable names to features after pipeli

I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables?

def oneHotEncoderExample(sqlContext: SQLContext): Unit = {

val df = sqlContext.createDataFrame(Seq(
    (0, "a", 1.0),
    (1, "b", 1.0),
    (2, "c", 0.0),
    (3, "d", 1.0),
    (4, "e", 1.0),
    (5, "f", 0.0)
)).toDF("id", "category", "label")
df.show()

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")
  .fit(df)
val indexed = indexer.transform(df)
indexed.select("id", "categoryIndex").show()

val encoder = new OneHotEncoder()
  .setInputCol("categoryIndex")
  .setOutputCol("features")
val encoded = encoder.transform(indexed)
encoded.select("id", "features").show()


val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.01)

val pipeline = new Pipeline()
  .setStages(Array(indexer, encoder, lr))

// Fit the pipeline to training documents.
val pipelineModel  = pipeline.fit(df)

val lorModel = pipelineModel.stages.last.asInstanceOf[LogisticRegressionModel]
println(s"LogisticRegression: ${(lorModel :LogisticRegressionModel)}")
// Print the weights and intercept for logistic regression.
println(s"Weights: ${lorModel.coefficients} Intercept: ${lorModel.intercept}")
}

Outputs

Weights: [1.5098946631236487,-5.509833649232324,1.5098946631236487,1.5098946631236487,-5.509833649232324] Intercept: 2.6679020381781235

标签： scala apache-spark apache-spark-mllib apache-spark-ml

1条回答

Viruses.

2楼-- · 2019-01-23 20:09

I assume what you want here is an access the features metadata. Lets start with transforming existing DataFrame:

val transformedDF = pipelineModel.transform(df)

Next you can extract metadata object:

val meta: org.apache.spark.sql.types.Metadata = transformedDF
  .schema(transformedDF.schema.fieldIndex("features"))
  .metadata

Finally lets extract attributes:

meta.getMetadata("ml_attr").getMetadata("attrs")
//  org.apache.spark.sql.types.Metadata = {"binary":[
//    {"idx":0,"name":"e"},{"idx":1,"name":"f"},{"idx":2,"name":"a"},
//    {"idx":3,"name":"b"},{"idx":4,"name":"c"}]}

These can be used to relate weights back to the original features.

0人赞添加讨论(0) 举报

How to map variable names to features after pipeli

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间