I'm using the PorterStemmer
Python Port
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.
For the following..
The other thing you need to do is reduce each word to its stem. For example, the words
sing
,sings
,singing
all have the same stem, which issing
. There is a reasonably accepted way to do this, which is called Porter's algorithm. You can download something that performs it from http://tartarus.org/martin/PorterStemmer/.
And I've modified the code..
if __name__ == '__main__':
p = PorterStemmer()
if len(sys.argv) > 1:
for f in sys.argv[1:]:
infile = open(f, 'r')
while 1:
output = ''
word = ''
line = infile.readline()
if line == '':
break
for c in line:
if c.isalpha():
word += c.lower()
else:
if word:
output += p.stem(word, 0,len(word)-1)
word = ''
output += c.lower()
print output,
infile.close()
To read from an input
and not a file from a preprocessed string and return the output.
def algorithm(input):
p = PorterStemmer()
while 1:
output = ''
word = ''
if input == '':
break
for c in input:
if c.isalpha():
word += c.lower()
else:
if word:
output += p.stem(word, 0,len(word)-1)
word = ''
output += c.lower()
return output
Note if I position my return output
onto the same indent as while 1:
it turns into an infinite loop
.
Usage (Example)
import PorterStemmer as ps
ps.algorithm("Michael is Singing");
Output
Michael is
Expected Output
Michael is Sing
What am I doing wrong?