Multi-term named entities in Stanford Named Entity

2019-03-09 07:53发布

I'm using the Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml and it's working fine. This is

    List<List<CoreLabel>> out = classifier.classify(text);
    for (List<CoreLabel> sentence : out) {
        for (CoreLabel word : sentence) {
            if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
                namedEntities.add(word.word().trim());           
            }
        }
    }

However the problem I'm finding is identifying names and surnames. If the recognizer encounters "Joe Smith", it is returning "Joe" and "Smith" separately. I'd really like it to return "Joe Smith" as one term.

Could this be achieved through the recognizer maybe through a configuration? I didn't find anything in the javadoc till now.

Thanks!

8条回答
戒情不戒烟
2楼-- · 2019-03-09 07:59

Make use of the classifiers already provided to you. I believe this is what you are looking for:

    private static String combineNERSequence(String text) {

    String serializedClassifier = "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz";      
    AbstractSequenceClassifier<CoreLabel> classifier = null;
    try {
        classifier = CRFClassifier
                .getClassifier(serializedClassifier);
    } catch (ClassCastException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    System.out.println(classifier.classifyWithInlineXML(text));

    //  FOR TSV FORMAT  //
    //System.out.print(classifier.classifyToString(text, "tsv", false));

    return classifier.classifyWithInlineXML(text);
}
查看更多
混吃等死
3楼-- · 2019-03-09 08:00
List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
    String s = "";
    String prevLabel = null;
    for (CoreLabel word : sentence) {
      if(prevLabel == null  || prevLabel.equals(word.get(CoreAnnotations.AnswerAnnotation.class)) ) {
         s = s + " " + word;
         prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
      }
      else {
        if(!prevLabel.equals("O"))
           System.out.println(s.trim() + '/' + prevLabel + ' ');
        s = " " + word;
        prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
      }
    }
    if(!prevLabel.equals("O"))
        System.out.println(s + '/' + prevLabel + ' ');
}

I just wrote a small logic and it's working fine. what I did is group words with same label if they are adjacent.

查看更多
再贱就再见
4楼-- · 2019-03-09 08:02

Code for the above:

<List> result = classifier.classifyToCharacterOffsets(text);

for (Triple<String, Integer, Integer> triple : result)
{
    System.out.println(triple.first + " : " + text.substring(triple.second, triple.third));
}
查看更多
仙女界的扛把子
5楼-- · 2019-03-09 08:05

The counterpart of the classifyToCharacterOffsets method is that (AFAIK) you can't access the label of the entities.

As proposed by Christopher, here is an example of a loop which assembles "adjacent non-O things". This example also counts the number of occurrences.

public HashMap<String, HashMap<String, Integer>> extractEntities(String text){

    HashMap<String, HashMap<String, Integer>> entities =
            new HashMap<String, HashMap<String, Integer>>();

    for (List<CoreLabel> lcl : classifier.classify(text)) {

        Iterator<CoreLabel> iterator = lcl.iterator();

        if (!iterator.hasNext())
            continue;

        CoreLabel cl = iterator.next();

        while (iterator.hasNext()) {
            String answer =
                    cl.getString(CoreAnnotations.AnswerAnnotation.class);

            if (answer.equals("O")) {
                cl = iterator.next();
                continue;
            }

            if (!entities.containsKey(answer))
                entities.put(answer, new HashMap<String, Integer>());

            String value = cl.getString(CoreAnnotations.ValueAnnotation.class);

            while (iterator.hasNext()) {
                cl = iterator.next();
                if (answer.equals(
                        cl.getString(CoreAnnotations.AnswerAnnotation.class)))
                    value = value + " " +
                           cl.getString(CoreAnnotations.ValueAnnotation.class);
                else {
                    if (!entities.get(answer).containsKey(value))
                        entities.get(answer).put(value, 0);

                    entities.get(answer).put(value,
                            entities.get(answer).get(value) + 1);

                    break;
                }
            }

            if (!iterator.hasNext())
                break;
        }
    }

    return entities;
}
查看更多
爱情/是我丢掉的垃圾
6楼-- · 2019-03-09 08:07

Another approach to deal with multi words entities. This code combines multiple tokens together if they have the same annotation and go in a row.

Restriction:
If the same token has two different annotations, the last one will be saved.

private Document getEntities(String fullText) {

    Document entitiesList = new Document();
    NERClassifierCombiner nerCombClassifier = loadNERClassifiers();

    if (nerCombClassifier != null) {

        List<List<CoreLabel>> results = nerCombClassifier.classify(fullText);

        for (List<CoreLabel> coreLabels : results) {

            String prevLabel = null;
            String prevToken = null;

            for (CoreLabel coreLabel : coreLabels) {

                String word = coreLabel.word();
                String annotation = coreLabel.get(CoreAnnotations.AnswerAnnotation.class);

                if (!"O".equals(annotation)) {

                    if (prevLabel == null) {
                        prevLabel = annotation;
                        prevToken = word;
                    } else {

                        if (prevLabel.equals(annotation)) {
                            prevToken += " " + word;
                        } else {
                            prevLabel = annotation;
                            prevToken = word;
                        }
                    }
                } else {

                    if (prevLabel != null) {
                        entitiesList.put(prevToken, prevLabel);
                        prevLabel = null;
                    }
                }
            }
        }
    }

    return entitiesList;
}

Imports:

Document: org.bson.Document;
NERClassifierCombiner: edu.stanford.nlp.ie.NERClassifierCombiner;
查看更多
够拽才男人
7楼-- · 2019-03-09 08:14

Here is my full code, I use Stanford core NLP and write algorithm to concatenate Multi Term names.

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import org.apache.log4j.Logger;

import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

/**
 * Created by Chanuka on 8/28/14 AD.
 */
public class FindNameEntityTypeExecutor {

private static Logger logger = Logger.getLogger(FindNameEntityTypeExecutor.class);

private StanfordCoreNLP pipeline;

public FindNameEntityTypeExecutor() {
    logger.info("Initializing Annotator pipeline ...");

    Properties props = new Properties();

    props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");

    pipeline = new StanfordCoreNLP(props);

    logger.info("Annotator pipeline initialized");
}

List<String> findNameEntityType(String text, String entity) {
    logger.info("Finding entity type matches in the " + text + " for entity type, " + entity);

    // create an empty Annotation just with the given text
    Annotation document = new Annotation(text);

    // run all Annotators on this text
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
    List<String> matches = new ArrayList<String>();

    for (CoreMap sentence : sentences) {

        int previousCount = 0;
        int count = 0;
        // traversing the words in the current sentence
        // a CoreLabel is a CoreMap with additional token-specific methods

        for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
            String word = token.get(CoreAnnotations.TextAnnotation.class);

            int previousWordIndex;
            if (entity.equals(token.get(CoreAnnotations.NamedEntityTagAnnotation.class))) {
                count++;
                if (previousCount != 0 && (previousCount + 1) == count) {
                    previousWordIndex = matches.size() - 1;
                    String previousWord = matches.get(previousWordIndex);
                    matches.remove(previousWordIndex);
                    previousWord = previousWord.concat(" " + word);
                    matches.add(previousWordIndex, previousWord);

                } else {
                    matches.add(word);
                }
                previousCount = count;
            }
            else
            {
                count=0;
                previousCount=0;
            }


        }

    }
    return matches;
}
}
查看更多
登录 后发表回答