Unable to load OpenNLP sentence model in Hadoop ma

2019-07-13 06:40发布

I'm trying to get OpenNLP integrated into a map-reduce job on Hadoop, starting with some basic sentence splitting. Within the map function, the following code is run:

public AnalysisFile analyze(String content) {
    InputStream modelIn = null;
    String[] sentences = null;

    // references an absolute path to en-sent.bin
    logger.info("sentenceModelPath: " + sentenceModelPath);

    try {
        modelIn = getClass().getResourceAsStream(sentenceModelPath);
        SentenceModel model = new SentenceModel(modelIn);
        SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
        sentences = sentenceBreaker.sentDetect(content);
    } catch (FileNotFoundException e) {
        logger.error("Unable to locate sentence model.");
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (modelIn != null) {
            try {
                modelIn.close();
            } catch (IOException e) {
            }
        }
    }

    logger.info("number of sentences: " + sentences.length);

    <snip>
}

When I run my job, I'm getting an error in the log saying "in must not be null!" (source of class throwing error), which means that somehow I can't open an InputStream to the model. Other tidbits:

  • I've verified that the model file exists in the location sentenceModelPath refers to.
  • I've added Maven dependencies for opennlp-maxent:3.0.2-incubating, opennlp-tools:1.5.2-incubating, and opennlp-uima:1.5.2-incubating.
  • Hadoop is just running on my local machine.

Most of this is boilerplate from the OpenNLP documentation. Is there something I'm missing, either on the Hadoop side or the OpenNLP side, that would cause me to be unable to read from the model?

1条回答
倾城 Initia
2楼-- · 2019-07-13 07:05

Your problem is the getClass().getResourceAsStream(sentenceModelPath) line. This will try to load a file from the classpath - neither the file in HDFS nor on the client local file system is part of the classpath at mapper / reducer runtime, so this is why you're seeing the Null error (the getResourceAsStream() returns null if the resource cannot be found).

To get around this you have a number of options:

  • Amend your code to load the file from HDFS:

    modelIn = FileSystem.get(context.getConfiguration()).open(
                     new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));
    
  • Amend your code to load the file from the local dir, and use the -files GenericOptionsParser option (which copies to file from the local file system to HDFS, and back down to the local directory of the running mapper / reducer):

    modelIn = new FileInputStream("en-sent.bin");
    
  • Hard-bake the file into the job jar (in the root dir of the jar), and amend your code to include a leading slash:
    modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>
    

查看更多
登录 后发表回答