I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.
My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.
See also:
I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.
My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.
See also:
Take a look at LemmaGen - open source library written in C# 3.0.
Results for your test words (http://lemmatise.ijs.si/Services)
If I may quote my answer to the question StompChicken mentioned:
As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".
If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.
http://wordnet.princeton.edu/man/morph.3WN
For a lot of my projects, I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming.
http://wordnet.princeton.edu/links#PHP has a link to a PHP interface to the WN APIs.
In Java, i use tartargus-snowball to stemming words
Maven:
Sample code:
I tried your list of terms on this snowball demo site and the results look okay....
A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.
I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.
This looks interesting: MIT Java WordnetStemmer: http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStemmer.html