Why is stanford corenlp gender identification nond

2020-07-23 04:13发布

问题:

I have the following results and as you can see the name edward has different results (null and male). This has happened with several names.

edward, Gender: null
james, Gender: MALE
karla, Gender: null
edward, Gender: MALE

Additionally, how can I customize the gender dictionaries? I want to add Spanish and Chinese names.

回答1:

You have raised a lot of issues!

1.) Karla is not in the default gender mappings file, so that is why that's getting null

2.) If you want to make your own custom file, it should be in this format:

JOHN\tMALE

There should be one NAME\tGENDER entry per line

The GenderAnnotator can only take 1 file for the mappings, so you need to make a new file with the names you want added on.

The default gender mappings file is in the stanford-corenlp-3.5.2-models.jar file.

You can extract the default gender mappings file from that jar in this manner:

  • mkdir tmp-stanford-models-expanded

  • cp /path/of/stanford-corenlp-3.5.2-models.jar tmp-stanford-models-expanded

  • cd tmp-stanford-models-expanded

  • jar xf stanford-corenlp-3.5.2-models.jar

  • there should now be tmp-stanford-models-expanded/edu

  • the file you want is tmp-stanford-models-expanded/edu/stanford/nlp/models/gender/first_name_map_small

3.) Build your pipeline in this manner to use your custom gender dictionary:

Properties props = new Properties();
props.setProperty("annotators",
    "tokenize, ssplit, pos, lemma, gender, ner");
props.setProperty("gender.firstnames","/path/to/your/gender_dictionary.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

4.) Try running gender BEFORE ner in your pipeline (see my ordering of the annotators above). It is possible for the RegexNERSequenceClassifier (which is the class that adds the Gender tags) to get blocked if tokens already have NER tags. It looks to me like running the gender annotator first will fix the problem. So when you build the pipeline, make sure gender comes before ner.

The sequence "edward james karla edward" is tagged "O O PERSON PERSON" by the NER tagger. I am not entirely sure why those first two tokens get "O" for their NER tags. I would note that "Edward James Karla Edward" yields "PERSON PERSON PERSON PERSON", and keep in mind the NER tagger factors in position in the sentence, so perhaps being lower cased at the beginning of the sentence is causing the first token "edward" to be marked as "O"?

If you have any issues with this, please let me know and I will be happy to help more!

TL;DR

1.) Karla is marked wrong because that name is not in the gender dictionary

2.) You can make your own gender mappings file with NAME\tGENDER , make sure the property "gender.firstnames" is set to path of your new gender mapping file.

3.) Make sure the gender annotator goes before the ner annotator, this should fix the problem!