I'll try do describe what I have in mind.
There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only one category. It's either "valid" or not.
What I want is to create a model based on already validated content, save it and using this model to "pre-validate" or mark new incoming content. Also once in a while to update the model based on a newly validated content. Hopefully I explained myself clearly.
I am thinking to use Spark streaming for data classification based on model created. And Naive Bayes algorithm. But how would you approach creating, updating and storing the model? There is ~200K+ validated resultants (texts) of various length. Do I need so many for the model? And how to use this model in Spark Streaming.
Thanks in advance.
Wow this question is very broad, and more related to
Machine Learning
than toApache Spark
, however I will try to give you some hints or steps to follow (I won't do the work for you).Import all the libraries you need
Load your data into an RDD
Tokenize your data (if you use ML it might be easier) and
Remove all unnecessary words (widely known as stop-words), and symbols e.g.
,.&
Create a dictionary with all
distinct
words in all your dataset, it sounds huge but they are not so many as you would expect, and I bet they will fit in your master node (however there are other ways to approach this, but for simplicity I will keep this way).Convert your
features
into a DenseVector or SparseVector, I would obviously recommend the second way because normally aSparseVector
requires less space to be represented, however it depends on the data. Note, there are better alternatives likehashing
, but I am trying to keep loyal to my verbose approach. After that transform thetuple
into a LabeledPointFit your favorite model, in this case I used LogisticRegressionWithSGD due ulterior motives.
Save your model.
Load your LogisticRegressionModel in another application.
Predict the categories.
Note that I didn't evaluate the
accuracy
of the model, however it looks very doesn't it?