Handling continuous data in Spark NaiveBayes

2019-01-27 06:17发布

As per official documentation of Spark NaiveBayes:

It supports Multinomial NB (see here) which can handle finitely supported discrete data.

How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?

标签： apache-spark apache-spark-mllib naivebayes

1条回答

甜甜的少女心

2楼-- · 2019-01-27 06:44

The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer or QuantileDiscretizer. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.

For encoding you can use dummy encoding using OneHotEncoder. with adjusted dropLast Param.

So overall you'll need:

QuantileDiscretizer or Bucketizer -> OneHotEncoder for each continuous feature.
StringIndexer* -> OneHotEncoder for each discrete feature.
VectorAssembler to combine all of the above.

* Or predefined column metadata.

0人赞添加讨论(0) 举报

Handling continuous data in Spark NaiveBayes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间