How can I match words regardless of tense or form?

2019-02-25 03:53发布

问题:

I am currently working on a script that runs through a document, pulls out all keywords, and then attempts to match these keywords with those found in other documents. There are some specifics that complicate this, but they are not very pertinent to me question. Basically I would like to be able to match words regardless of the tense in which they appear.

For example: If given the strings "swim", "swam", and "swimming", I would like a program that can recognize that these are all the same word, though whether it would store the word as swim, swam or swimming doesn't matter all that much to me.

I'm aware that this problem could be mostly solved with a dictionary containing all of these word forms, but I am unaware of any dictionary that is mapped in such a way to be useful for this. I would prefer a solution or library that is compatible with Python, since that is what I am currently using for this scripting, but I would be fine with a solution in just about any language (save haskell or eiffel or something similarly obscure/difficult to work with)

回答1:

Check out pywordnet.

>>> N['dog']
dog(n.)
>>> N['dog'].getSenses()
('dog' in {noun: dog, domestic dog, Canis familiaris},
 'dog' in {noun: frump, dog}, 'dog' in {noun: dog},
 'dog' in {noun: cad, bounder, blackguard, dog, hound, heel},
 'dog' in {noun: pawl, detent, click, dog},
 'dog' in {noun: andiron, firedog, dog, dogiron})


回答2:

From your question, it sounds like you're looking for a stemming or lemmatization algorithm, which essentially maps each word to its dictionary form. One well-known such algorithm is the Porter Stemming algorithm, which has been around for three decades and has implementations in a variety of languages, including Python. You can find a list of these implementations at http://tartarus.org/martin/PorterStemmer/ .

While the Porter stemmer's been around a long time and can be useful for comparison reasons, Spaceghost correctly points out that this isn't necessarily the best system available. Snowball is supposed to be better than the Porter stemming algorithm.



回答3:

This problem you describe appears to be a Stemming problem, they are some useful stemmers out there like the porter stemmer. More specifically try implement it using the nltk tool kit for Python which if im not mistaken comes with a porter stemmer.