I am trying to do something StringIndexer on a column of sentences, i.e. transforming list of words to list of integers.
For example:
input dataset:
(1, ["I", "like", "Spark"])
(2, ["I", "hate", "Spark"])
I expected the output after StringIndexer to be like:
(1, [0, 2, 1])
(2, [0, 3, 1])
Ideally, I would like to make such transformation as part of Pipeline, so that I can chain couple transformer together and serialize for online serving.
Is this something Spark support natively?
Thank you!
Standard
Transformers
used for converting text to features areCountVectorizer
or
HashingTF
:Both have
binary
option which can used to switch from count to binary vector.There is no builtin
Transfomer
that can give exact result you want (it wouldn't be useful for ML algorithms) buy you canexplode
applyStringIndexer
, andcollect_list
/collect_set
:With
CountVectorizer
andudf
: