I am looking for a lemmatisation implementation for English in Java. I found a few already, but I need something that does not need to much memory to run (1 GB top). Thanks. I do not need a stemmer.
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
There is a JNI to hunspell, which is the checker used in open office and FireFox. http://hunspell.sourceforge.net/
You can try the free Lemmatizer API here: http://twinword.com/lemmatizer.php
Scroll down to find the Lemmatizer endpoint.
This will allow you to get "dogs" to "dog", "abilities" to "ability".
If you pass in a POST or GET parameter called "text" with a string like "walked plants":
You get a response like this:
Check out Lucene Snowball.
Chris's answer regarding the Standford Lemmatizer is great! Absolutely beautiful. He even included a pointer to the jar files, so I didn't have to google for it.
But one of his lines of code had a syntax error (he somehow switched the ending closing parentheses and semicolon in the line that begins with "lemmas.add...), and he forgot to include the imports.
As far as the NoSuchMethodError error, it's usually caused by that method not being made public static, but if you look at the code itself (at http://grepcode.com/file/repo1.maven.org/maven2/com.guokr/stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h) that is not the problem. I suspect that the problem is somewhere in the build path (I'm using Eclipse Kepler, so it was no problem configuring the 33 jar files that I use in my project).
Below is my minor correction of Chris's code, along with an example (my apologies to Evanescence for butchering their perfect lyrics):
Here is my results (I was very impressed; it caught "'s" as "is" (sometimes), and did almost everything else perfectly):
Starting Stanford Lemmatizer
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.7 sec].
Adding annotator lemma
[how, could, you, be, see, into, my, eye, like, open, door, ?, you, lead, I, down, into, my, core, where, I, have, become, so, numb, without, a, soul, my, spirit, 's, sleep, somewhere, cold, until, you, find, it, there, and, lead, it, back, home, you, wake, I, up, inside, call, my, name, and, save, I, from, the, dark, you, have, bid, my, blood, and, it, run, before, I, would, become, undo, you, save, I, from, the, nothing, I, have, almost, become, you, be, bring, I, to, life, now, that, I, know, what, I, be, without, you, can, have, just, leave, I, you, breathe, into, I, and, make, I, real, frozen, inside, without, you, touch, without, you, love, ,, darling, only, you, be, the, life, among, the, dead, I, have, be, live, a, lie, ,, there, be, nothing, inside, you, be, bring, I, to, life, .]
The Stanford CoreNLP Java library contains a lemmatizer that is a little resource intensive but I have run it on my laptop with <512MB of RAM.
To use it: