Forcing POS tags in Stanford CoreNLP

2019-09-10 08:39发布

问题:

Is there a way to process an already POS-tagged text using Stanford CoreNLP?

For example, I have the sentence in this format

They_PRP are_VBP hunting_VBG dogs_NNS ._.

and I'd like to annotate with lemma, ner, parse, etc. by forcing the given POS annotation.

Update. I tried this code, but it's not working.

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma"); 

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String sentText = "They_PRP are_VBP hunting_VBG dogs_NNS ._.";
List<CoreLabel> sentence = new ArrayList<>();

String[] parts = sentText.split("\\s");
for (String p : parts) {
    String[] split = p.split("_");
    CoreLabel clToken = new CoreLabel();
    clToken.setValue(split[0]);
    clToken.setWord(split[0]);
    clToken.setOriginalText(split[0]);
    clToken.set(CoreAnnotations.PartOfSpeechAnnotation.class, split[1]);
    sentence.add(clToken);
}
Annotation s = new Annotation(sentText);
s.set(CoreAnnotations.TokensAnnotation.class, sentence);

Annotation document = new Annotation(s);
pipeline.annotate(document);

回答1:

The POS annotations will certainly be replaced if you include the pos annotator in the pipeline.

Instead, remove the pos annotator and add the option -enforceRequirements false. This will allow the pipeline to run even though an annotator which lemma, etc. depend on (the pos annotator) is not present. Add the following line before pipeline instantiation:

props.setProperty("enforceRequirements", "false");

Of course, behavior is undefined if you venture into this area without setting the proper annotations, so make sure you match the annotations made by the relevant annotator (POSTaggerAnnotator in this case).