spark pipeline vector assembler drop other columns

2019-09-15 15:56发布

A spark VectorAssembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler produces the following output

id | hour | mobile | userFeatures     | clicked | features
----|------|--------|------------------|---------|-----------------------------
 0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0     | [18.0, 1.0, 0.0, 10.0, 0.5]

as you can see the last column contains all the previous features. Is it better / more performant if the other columns are removed e.g. only the label/id and features are retained or is this an unnecessary overhead and just feeding label/id and features into the estimator is enough?

What happens when the VectorAssembler is used in a pipeline? will only the last features be used or will it introduce colinearity (duplicate columns) if the original columns are not removed manually?

1条回答
Root(大扎)
2楼-- · 2019-09-15 16:11

Please read carefully the documentation. Every classifier is parametrized by the features column (featuresCol). It doesn't consider any other column or the order of columns.

查看更多
登录 后发表回答