Porter Stemmer Algorithm Not returning the expecte

2019-03-01 06:14发布

问题:

I'm using the PorterStemmer Python Port

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

For the following..

The other thing you need to do is reduce each word to its stem. For example, the words sing, sings, singing all have the same stem, which is sing. There is a reasonably accepted way to do this, which is called Porter's algorithm. You can download something that performs it from http://tartarus.org/martin/PorterStemmer/.

And I've modified the code..

if __name__ == '__main__':
    p = PorterStemmer()
    if len(sys.argv) > 1:
        for f in sys.argv[1:]:
            infile = open(f, 'r')
            while 1:
                output = ''
                word = ''
                line = infile.readline()
                if line == '':
                    break
                for c in line:
                    if c.isalpha():
                        word += c.lower()
                    else:
                        if word:
                            output += p.stem(word, 0,len(word)-1)
                            word = ''
                        output += c.lower()
                print output,
            infile.close()

To read from an input and not a file from a preprocessed string and return the output.

def algorithm(input):
    p = PorterStemmer()
    while 1:
        output = ''
        word = ''
        if input == '':
            break
        for c in input:
            if c.isalpha():
                word += c.lower()
            else:
                if word:
                    output += p.stem(word, 0,len(word)-1)
                    word = ''
                output += c.lower()
        return output

Note if I position my return output onto the same indent as while 1: it turns into an infinite loop.

Usage (Example)

import PorterStemmer as ps
ps.algorithm("Michael is Singing");

Output

Michael is

Expected Output

Michael is Sing

What am I doing wrong?

回答1:

So it looks like the culprit is that it doesn't currently write the final part of the input to output (try "Michael is Singing stuff", for example - it should write everything correctly and omit 'stuff'). There is likely a more elegant way to handle this, but one thing you could try is adding an else clause to the for loop. Since the issue is that the final word is not being included in output, we can use else to make sure that the final word gets added upon the completion of the for loop:

def algorithm(input):
    print input
    p = PorterStemmer()
    while 1:
        output = ''
        word = ''
        if input == '':
            break
        for c in input:
            if c.isalpha():
                word += c.lower()
            elif word:
                output += p.stem(word, 0,len(word)-1)
                word = ''
                output += c.lower()
        else:
            output += p.stem(word, 0, len(word)-1)  
        print output
        return output

This has been extensively tested with two test cases, so clearly it is bulletproof :) There are probably some edge cases crawling around there, but hopefully it will get you started.