gender identification in natural language processi

2019-04-02 08:53发布

I have written below code using stanford nlp packages.

GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);

But for the sentence "Annie goes to school", it is not able to identify the gender of Annie.

The output of application is:

     [Text=Annie CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Annie NamedEntityTag=PERSON] 
     [Text=goes CharacterOffsetBegin=6 CharacterOffsetEnd=10 PartOfSpeech=VBZ Lemma=go NamedEntityTag=O] 
     [Text=to CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=TO Lemma=to NamedEntityTag=O] 
     [Text=school CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NN Lemma=school NamedEntityTag=O] 
     [Text=. CharacterOffsetBegin=20 CharacterOffsetEnd=21 PartOfSpeech=. Lemma=. NamedEntityTag=O]

What is the correct approach to get the gender?

3条回答
仙女界的扛把子
2楼-- · 2019-04-02 09:35

The gender annotator doesn't add the information to the text output but you can still access it through code as shown in the following snippet:

Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation("Annie goes to school");

pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
  for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
    System.out.print(token.value());
    System.out.print(", Gender: ");
    System.out.println(token.get(MachineReadingAnnotations.GenderAnnotation.class));
  }
}

Output:

Annie, Gender: FEMALE
goes, Gender: null
to, Gender: null
school, Gender: null
查看更多
何必那么认真
3楼-- · 2019-04-02 09:36

If your named entity recognizer outputs PERSON for a token, you might use (or build if you don't have one) a gender classifier based on first names. As an example, see the Gender Identification section from the NLTK library tutorial pages. They use the following features:

  • Last letter of name.
  • First letter of name.
  • Length of name (number of characters).
  • Character unigram presence (boolean whether a character is in the name).

Though, I have a hunch that using character n-gram frequency---possibly up to character trigrams---will give you pretty good results.

查看更多
疯言疯语
4楼-- · 2019-04-02 09:43

There are a lot of approaches and one of them is outlined in nltk cookbook.

Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.

import nltk
import random

def extract_features(name):
    name = name.lower()
    return {
        'last_char': name[-1],
        'last_two': name[-2:],
        'last_three': name[-3:],
        'first': name[0],
        'first2': name[:1]
    }

f_names = nltk.corpus.names.words('female.txt')
m_names = nltk.corpus.names.words('male.txt')

all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]
random.shuffle(all_names)

test_set = all_names[500:]
train_set= all_names[:500]

test_set_feat = [(extract_features(n), g) for n, g in test_set]
train_set_feat= [(extract_features(n), g) for n, g in train_set]

classifier = nltk.NaiveBayesClassifier.train(train_set_feat)

print nltk.classify.accuracy(classifier, test_set_feat)

This basic test gives you approximately 77% of accuracy.

查看更多
登录 后发表回答