-->

nlp - How to detect if a word in a sentence is poi

2019-06-02 10:35发布

问题:

So as the title suggests I would like to know if a certain word in a sentence is pointing to

1] A color

The grass is green.

Hence "green" is color

2] A body part

Her hands are soft

Hence "hands" is a body part

3] A vehicle

I am driving my car on the causeway

Hence "car" is a vehicle

In similar problems, parsers are one of the possible effective solutions. Stanford parser for example was suggested to a similar question

How to find if a word in a sentence is pointing to a city

Now the problem is that stanford parser can be used to detect:

LOCATION
ORGANIZATION
DATE
MONEY
PERSON
PERCENT
TIME

However if you would like to try to detect something else, word-net might be an option as mentioned in a similar question

How do I list out all English terms in a sentence that indicate an animal?

One of the answers suggested using wordnet and leveraging the hyponym/hypernym relation. The answer also mentioned the noun.animal file of wordnet.

The link below shows a list of all other files in wordnet https://wordnet.princeton.edu/man/lexnames.5WN.html

My approach is that I can make use of

1]

(noun.body FOR body parts)

2]

(noun.artifact FOR vehicles)

3]

 The (hyponym/hypernym) relationship can be used to detect if word is pointing to a color or not.

So would that be a valid approach ?

And how can I make use of the (hyponym/hypernym) is wordnet ?

NOTE: I'am planning to use: JWI (the MIT Java Wordnet Interface)

回答1:

Referring to the hyponymy / hypernymy approach, this would involve exploring the wordnet tree and its relations between words.

The hyponyms of a word (of a Synset, to be more accurate) represent concepts which are more particular in nature, while hypernyms represent concepts more general in nature. As an analogy with the tree-like structure of Wordnet, you could view the hyponyms as children of the word (node) you are looking at, with hypernyms being parents of that word.

As an example, taking the hyponyms and the hypernyms of the word dog:

dog = wn.synsets('dog')[0]
print(dog.hypernyms())
print(dog.hyponyms())

yields the following results:

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

[Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), 
Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), S 
Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), 
Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), 
Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), 
Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), 
Synset('toy_dog.n.01'), Synset('working_dog.n.01')]

In a similar manner, if we wanted for example to know which words represent colours, we could explore the hypernyms of different words representing colours, hoping that they would have a common ancestor (hypernym). In this sense, I have done the following experiments:

print(wn.synsets('green')[0].hypernyms())
print(wn.synsets('blue')[0].hypernyms())
print(wn.synsets('red')[0].hypernyms())
print(wn.synsets('yellow')[0].hypernyms())

all of which share the same hypernym list:

[Synset('chromatic_color.n.01')]

Also

print(wn.synsets('black')[0].hypernyms())
print(wn.synsets('gray')[0].hypernyms())

yield the result

[Synset('achromatic_color.n.01')]

Next thing we can do is print all the hyponyms of these resulting synsets:

print(wn.synset('chromatic_color.n.01').hyponyms())
print(wn.synset('chromatic_color.n.01').hyponyms())

which give the results

[Synset('blond.n.02'), Synset('blue.n.01'), Synset('brown.n.01'), 
Synset('complementary_color.n.01'), Synset('green.n.01'), 
Synset('olive.n.05'), Synset('orange.n.02'), Synset('pastel.n.01'), 
Synset('pink.n.01'), Synset('purple.n.01'), Synset('red.n.01'), 
Synset('salmon.n.04'), Synset('yellow.n.01')]

[Synset('black.n.01'), Synset('gray.n.01'), Synset('white.n.02')]

The same technique could be applied to explore options relating to body parts or vehicles.

Also, in the case of derivative words such as reddish, there are two methods of bypassing their absence that I know of:

  • Stemming the tokenized text, by means of Porter Stemmer (see this link)
  • Using Morphy to get the base forms, letting you look up the resulting words in Wordnet (see this link for details on Morphy). I would recommend this method, since stemming could potentially yield words which do not exist in Wordnet