Programmatic Way to Determine Related Words?

2019-07-15 20:31发布

问题:

Using a web service or software library, I'd like to be able to discern words related by a root word (e.g., "seated" and "seatbelt" share the root word "seat" but "Seattle" wouldn't be considered a match). Simple string comparison seems unfeasible for this sort of thing.

Short of defining my own dictionary, are there any libraries or web services that can not merely return word definitions, but return a word's "root words" so I can perform this type of check?

回答1:

Here is the Snowball stemmer for english.

you can use it as:

SF.Snowball.Ext.EnglishStemmer eng = new SF.Snowball.Ext.EnglishStemmer();
eng.SetCurrent("Seated");
eng.Stem();
Console.WriteLine(eng.GetCurrent()); //returns Seat


回答2:

This is a complicated thing to ask a computer to do, but there are ways and means.

This question has a few techniques:

A "regex for words" (semantic replacement) - any example syntax and libraries?

But if you want to experiment yourself, what you could consider doing is looking at phonemes and Soundex's or Double Metaphone algorithms. Have a look on wikipedia for 'Phonetic algorithms'.

The idea is simply that you work out what a word sounds like, which can then be represented mathematically, and you can then compare this against your dictionary which has been precomputed with this representation.

What this will do is reduce the dictionary to a (hopefully) workable set of data that you will have to analyse somehow.

For your specific example, you'll have to compare the algorithmic values of Seated,Seatbelt and Seattle though.

Anyway, I know this isn't a full answer, but I hope it's enough to get you started.

Good luck!



回答3:

I remember reading a somewhat related question on this site where the best answer was to download a copy of Wikipedia where you would disregard all of what you didn't need.

After checking out some popular etymology and root word search websites, they all failed with seat as the query (wordinfo, prefixsuffix, and etymonline).

If seat was just an example and the three most popular services for finding related words failed, they probably will not be your best bet. For this reason I would recommend Wiktionary.

Almost every page on Wiktionary is very detailed and even for seat, it lists all the related words under the Verb section.

seat (third-person singular simple present seats, present participle seating, simple past and past participle seated)

They are even bolded and hyperlinked so it would be trivial to parse them them in to a local dictionary.

Personally, I much prefer having a local table rather than utilizing a web service because the web service can go down, it can be slow, and it requires your users to be connected to the internet in order to use your application.