Converting plural to singular in a text file with

2020-03-01 19:52发布

问题:

I have txt files that look like this:

word, 23
Words, 2
test, 1
tests, 4

And I want them to look like this:

word, 23
word, 2
test, 1
test, 4

I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:

import nltk

f = raw_input("Please enter a filename: ")

def openfile(f):
    with open(f,'r') as a:
       a = a.read()
       a = a.lower()
       return a

def stem(a):
    p = nltk.PorterStemmer()
    [p.stem(word) for word in a]
    return a

def returnfile(f, a):
    with open(f,'w') as d:
        d = d.write(a)
    #d.close()

print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

I have also tried these 2 definitions instead of the stem definition:

def singular(a):
    for line in a:
        line = line[0]
        line = str(line)
        stemmer = nltk.PorterStemmer()
        line = stemmer.stem(line)
        return line

def stem(a):
    for word in a:
        for suffix in ['s']:
            if word.endswith(suffix):
                return word[:-len(suffix)]
            return word

Afterwards I'd like to take duplicate words (e.g. test and test) and merge them by adding up the numbers next to them. For example:

word, 25
test, 5

I'm not sure how to do that. A solution would be nice but not necessary.

回答1:

It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.

def openfile(f):
    with open(f,'r') as a:
        a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        a = a.lower()
        return a

This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.

def stem(a):
    p = nltk.PorterStemmer()
    a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
    return a

This definitely doesn't work for your purposes, and there are a few different things we can do.

  1. We can change it so that we read the input file as one list of lines
  2. We can use the big string and break it down into a list ourselves.
  3. We can go through and stem each line in the list of lines one at a time.

Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:

def openfile(f):
    with open(f,'r') as a:
        a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        b = [x.lower() for x in a]
        return b

This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:

def stem(a):
    p = nltk.PorterStemmer()
    b = []
    for line in a:
        split_line = line.split(',') #break it up so we can get access to the word
        new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together 
        b.append(new_line) #add it to the new list of lines
    return b

This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.

def returnfile(f, a):
    with open(f,'w') as d:
        for line in a:
            d.write(line)


print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

When I have the following input.txt

soc, 32
socs, 1
dogs, 8

I get the following stdout:

Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None

And input.txt looks like this:

soc, 32
soc, 1
dog, 8

The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.



回答2:

If you have complex words to singularize, I don't advise you to use stemming but a proper python package link pattern :

from pattern.text.en import singularize

plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
           'families', 'dogs', 'child', 'wolves']

singles = [singularize(plural) for plural in plurals]
print singles

returns:

>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']

It's not perfect but it's the best I found. 96% based on the docs : http://www.clips.ua.ac.be/pages/pattern-en#pluralization



回答3:

The Nodebox English Linguistics library contains scripts for converting plural form to single form and vice versa. Checkout tutorial: https://www.nodebox.net/code/index.php/Linguistics#pluralization

To convert plural to single just import singular module and use singular() function. It handles proper conversions for words with different endings, irregular forms, etc.

from en import singular
print(singular('analyses'))   
print(singular('planetoids'))
print(singular('children'))
>>> analysis
>>> planetoid
>>> child