As per official documentation of Spark NaiveBayes:
It supports Multinomial NB (see here) which can handle finitely supported discrete data.
How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?
As per official documentation of Spark NaiveBayes:
It supports Multinomial NB (see here) which can handle finitely supported discrete data.
How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?
The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either
Buketizer
orQuantileDiscretizer
. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.For encoding you can use dummy encoding using
OneHotEncoder
. with adjusteddropLast
Param
.So overall you'll need:
QuantileDiscretizer
orBucketizer
->OneHotEncoder
for each continuous feature.StringIndexer
* ->OneHotEncoder
for each discrete feature.VectorAssembler
to combine all of the above.* Or predefined column metadata.