
How do I list out all English terms in a sentence

2019-07-04 06:25发布


For example, in the sentence "The two horses had just lain down when a brood of ducklings, which had lost their mother, filed into the barn, cheeping feebly and wandering from side to side to find some place where they would not be trodden on.", there are two animals: horse and duck.

I was looking for vocabulary lists for animal names but was unable to get anything that's complete enough. The WordNet database looks promising but may be overkill and not broad enough either.


WordNet is an excellent tool, and I think you are on the right track. The relation that you are looking for is a hyponym/hypernym relation: the noun horse as a hyponym of animal, and, conversely, animal is a hypernym of horse. WordNet does provide data to evaluate whether two nouns are in this relationship.

Speaking of WordNet, you will probably find all animals in the noun.animal file. This may make your particular problem simpler.

To go from duckling to duck, you would navigate WordNet's sister term relation, which gives a collection of related words. I am not sure if you would get false positives from that, but probably there will be some. Duck and duckling are also listed in a derivationally-related relationship, but lion and cub are not. This might be a moot point, since both duckling and cub are, in some word senses, are animals.

You must, however, tag parts of speech, and take only nouns into account, otherwise you would get false positives when the sentence uses verbs to horse around and to duck (jerk down). Part-of-speech (POS) tagging is a whole problem in itself, and you probably want to look at some of existing libraries that do it. Most successful use a statistical approach, but the results are pretty robust, although might be not 100% correct.

Also, you will inevitably get other type false positives, from noun homonymy. For example, a horse may refer to a piece of gymnastics equipment, which is obviously not an animal. Duck can also refer to a type of fabric. Without deeper context you will not likely be able to resolve such a homonymy. But without a full general intelligence that would understand the text completely, this problem is rather not exactly solvable.