I have written below code using stanford nlp packages.
GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);
But for the sentence "Annie goes to school", it is not able to identify the gender of Annie.
The output of application is:
[Text=Annie CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Annie NamedEntityTag=PERSON]
[Text=goes CharacterOffsetBegin=6 CharacterOffsetEnd=10 PartOfSpeech=VBZ Lemma=go NamedEntityTag=O]
[Text=to CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=TO Lemma=to NamedEntityTag=O]
[Text=school CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NN Lemma=school NamedEntityTag=O]
[Text=. CharacterOffsetBegin=20 CharacterOffsetEnd=21 PartOfSpeech=. Lemma=. NamedEntityTag=O]
What is the correct approach to get the gender?
The gender annotator doesn't add the information to the text output but you can still access it through code as shown in the following snippet:
Output:
If your named entity recognizer outputs
PERSON
for a token, you might use (or build if you don't have one) a gender classifier based on first names. As an example, see the Gender Identification section from the NLTK library tutorial pages. They use the following features:Though, I have a hunch that using character n-gram frequency---possibly up to character trigrams---will give you pretty good results.
There are a lot of approaches and one of them is outlined in nltk cookbook.
Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.
This basic test gives you approximately 77% of accuracy.