I'm thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement)
I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex.
String one = "I decided buy something from the shop.";
String two = "Nevertheless I decidedly bought something from a shop.";
Now that I got those strings
Stemming: Can I just use the stemmer algoritmen directly on it, save it as a String and then continue working on the similarity like I did before implementing the stemmer in the program, like running one.stem(); kind of thing?
Stop word: How does this work out? O.o Do I just use; one.replaceall("I", ""); or is there some specific way to use for this proces? I want to keep working with the string and get a string before using the similarity algorithms on it to get the similarity. Wiki doesn't say a lot.
Hope you can help me out! Thanks.
Edit: It is for a school-related project where I'm writing a paper on similarity between different algorithms so I don't think I'm allowed to use lucene or other libraries that does the work for me. Plus I would like to try and understand how it works before I start using the libraries like Lucene and co. Hope it's not too much a bother ^^
You don't have to deal with the whole text. Just split it, apply your stopword filter and stemming algorithm, then build the string again using a
StringBuilder
:If you're not implementing this for academic reasons you should consider using the Lucene library. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here's a quick example using Lucene 3.0 to remove stop words and stem an input string:
Which if used on your strings like this:
Yields this output:
Yes, you can wrap any stemmer so that you can write something like
Internally, your stemAndRemoveStopwords would