How can I match words regardless of tense or form?

2019-02-25 03:28发布

I am currently working on a script that runs through a document, pulls out all keywords, and then attempts to match these keywords with those found in other documents. There are some specifics that complicate this, but they are not very pertinent to me question. Basically I would like to be able to match words regardless of the tense in which they appear.

For example: If given the strings "swim", "swam", and "swimming", I would like a program that can recognize that these are all the same word, though whether it would store the word as swim, swam or swimming doesn't matter all that much to me.

I'm aware that this problem could be mostly solved with a dictionary containing all of these word forms, but I am unaware of any dictionary that is mapped in such a way to be useful for this. I would prefer a solution or library that is compatible with Python, since that is what I am currently using for this scripting, but I would be fine with a solution in just about any language (save haskell or eiffel or something similarly obscure/difficult to work with)

3条回答
我命由我不由天
2楼-- · 2019-02-25 03:38

From your question, it sounds like you're looking for a stemming or lemmatization algorithm, which essentially maps each word to its dictionary form. One well-known such algorithm is the Porter Stemming algorithm, which has been around for three decades and has implementations in a variety of languages, including Python. You can find a list of these implementations at http://tartarus.org/martin/PorterStemmer/ .

While the Porter stemmer's been around a long time and can be useful for comparison reasons, Spaceghost correctly points out that this isn't necessarily the best system available. Snowball is supposed to be better than the Porter stemming algorithm.

查看更多
劳资没心,怎么记你
3楼-- · 2019-02-25 03:50

Check out pywordnet.

>>> N['dog']
dog(n.)
>>> N['dog'].getSenses()
('dog' in {noun: dog, domestic dog, Canis familiaris},
 'dog' in {noun: frump, dog}, 'dog' in {noun: dog},
 'dog' in {noun: cad, bounder, blackguard, dog, hound, heel},
 'dog' in {noun: pawl, detent, click, dog},
 'dog' in {noun: andiron, firedog, dog, dogiron})
查看更多
聊天终结者
4楼-- · 2019-02-25 03:50

This problem you describe appears to be a Stemming problem, they are some useful stemmers out there like the porter stemmer. More specifically try implement it using the nltk tool kit for Python which if im not mistaken comes with a porter stemmer.

查看更多
登录 后发表回答