I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.
My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.
See also:
I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.
My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.
See also:
Do a search foR Lucene, im not sure if theres a PHP port but i do know Lucene is available for many platforms. Lucene is an OSS (from Apache) indexing and search library. Naturally it and community extras might have something interesting to look at. At the very least you can learn how its done in one language so you can translate the "idea" into PHP
Look into WordNet, a large lexical database for the English language:
http://wordnet.princeton.edu/
There are APIs for accessing it in several languages.
The most current version of the stemmer in NLTK is Snowball.
You can find examples on how to use it here:
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo
If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.
Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:
You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:
There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.
.Net lucene has an inbuilt porter stemmer. You can try that. But note that porter stemming does not consider word context when deriving the lemma. (Go through the algorithm and its implementation and you will see how it works)