I want to use lemmatization on a text file:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
cables cables finally able hear gem long rumored music .
...
and expected output is :
surprise heard thump open door small seed man clasp package wrap.
upgrade system found review spring 2008 issue mood audio back.
omg left gotta wrap review order asap . understand hand deliver dali lama
speak hand wear earplug live . listen maintain link long .
cable cable final able hear gem long rumor music .
...
Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?
There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:
Now just use this for every line in mapper.
EDIT:
I added to the code line
this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.
I used scala 2.10.4 and fallowing stanford.nlp dependencies:
You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.
EDIT:
MapPartition version:
Although i dont know if its gonna speed up job significantly.
I think @user52045 has the right idea. The only modification I would make would be to use mapPartitions instead of map -- this allows you to only do the potentially expensive pipeline creation once per partition. This may not be a huge hit on a lemmatization pipeline, but it will be extremely important if you want to do something that requires a model, like the NER portion of the pipeline.
I would suggest using the Stanford CoreNLP wrapper for Apache Spark as it gives the official API for the basic core nlp function such as Lemmatization, tokenization, etc.
I have used the same for lemmatization on a spark dataframe.
Link to use :https://github.com/databricks/spark-corenlp