I'm using the Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml and it's working fine. This is
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
namedEntities.add(word.word().trim());
}
}
}
However the problem I'm finding is identifying names and surnames. If the recognizer encounters "Joe Smith", it is returning "Joe" and "Smith" separately. I'd really like it to return "Joe Smith" as one term.
Could this be achieved through the recognizer maybe through a configuration? I didn't find anything in the javadoc till now.
Thanks!
Make use of the classifiers already provided to you. I believe this is what you are looking for:
I just wrote a small logic and it's working fine. what I did is group words with same label if they are adjacent.
Code for the above:
The counterpart of the classifyToCharacterOffsets method is that (AFAIK) you can't access the label of the entities.
As proposed by Christopher, here is an example of a loop which assembles "adjacent non-O things". This example also counts the number of occurrences.
Another approach to deal with multi words entities. This code combines multiple tokens together if they have the same annotation and go in a row.
Restriction:
If the same token has two different annotations, the last one will be saved.
Imports:
Here is my full code, I use Stanford core NLP and write algorithm to concatenate Multi Term names.