How to Preserve Original Line Numbering in the Out

2019-08-06 05:54发布

Text corpora are often distributed as large files containing specific documents on each new line. For instance, I have a file with 10 million product reviews, one per line, and each review contains multiple sentences.

When processing such files with Stanford CoreNLP, using the command line, for instance

java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma -file test.txt

the output, whether in text or xml format, will number all sentences from 1 to n, ignoring the original line numbering that separates the documents.

I would like to keep track of the original file's line numbering (e.g. in xml format, to have an output tree like <original_line id=1>, then <sentence id=1>, then <token id=1>). Or else, to be able to reset the numbering of sentences at the start of each new line in the original file.

I have tried the answer to a similar question about Stanford's POS tagger, without success. Those options do not keep track of the original line numbers.

A quick solution could be to split the original file in multiple files, then processing each of them with CoreNLP and the -filelist input option. However, for large files with millions of documents, creating millions of individual files just to preserve the original line/document numbering seems inefficient.

I suppose it would be possible to modify the source code of Stanford CoreNLP, but I am unfamiliar with Java.

Any solution to preserve the original line numbering in the output would be very helpful, whether through the command line or by showing an example Java code that would achieve that.

2条回答
够拽才男人
2楼-- · 2019-08-06 06:22

The Question is already answered but i had the same problem and came up with a command line solution that worked for me. The trick was to specify the tokenizerFactory and give it the option tokenizeNLs=true

It looks like this:

java -mx1g -cp stanford-corenlp-3.6.0.jar:slf4j-api.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier english.conll.4class.distsim.normal.tagger -outputFormat slashTags -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions "tokenizeNLs=true" -textFile untagged_lines.txt > tagged_lines.txt
查看更多
我命由我不由天
3楼-- · 2019-08-06 06:45

I've dug through the code base, and I can't find a command line flag that will help you.

I wrote some sample Java code that should do the trick.

I put this in DocPerLineProcessor.java, which I put into stanford-corenlp-full-2015-04-20. I also put a file called sample-doc-per-line.txt which had 4 sentences per line.

First make sure to compile:

cd stanford-corenlp-full-2015-04-20

javac -cp "*:." DocPerLineProcessor.java

Here is the command to run:

java -cp "*:." DocPerLineProcessor sample-doc-per-line.txt

The output sample-doc-per-line.txt.xml should be the desired xml format, but sentences now have which line number they're on.

Here is the code:

import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*; 
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;

public class DocPerLineProcessor {
    public static void main (String[] args) throws IOException {
        // set up properties
        Properties props = new Properties();
        props.setProperty("annotators",
            "tokenize, ssplit, pos, lemma, ner, parse");
        // set up pipeline
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        // read in a product review per line
        Iterable<String> lines = IOUtils.readLines(args[0]);
        Annotation mainAnnotation = new Annotation("");
        // add a blank list to put sentences into
        List<CoreMap> blankSentencesList = new ArrayList<CoreMap>();
        mainAnnotation.set(CoreAnnotations.SentencesAnnotation.class,blankSentencesList);
        // process each product review
        int lineNumber = 1;
        for (String line : lines) {
            Annotation annotation = new Annotation(line);
            pipeline.annotate(annotation);
            for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
                sentence.set(CoreAnnotations.LineNumberAnnotation.class,lineNumber);
                mainAnnotation.get(CoreAnnotations.SentencesAnnotation.class).add(sentence);
            }
            lineNumber += 1;
        }
        PrintWriter xmlOut = new PrintWriter(args[0]+".xml");
        pipeline.xmlPrint(mainAnnotation, xmlOut);
    }
}

Now when I run this, the sentence tags also have the appropriate line number. So the sentences still have a global id, but you can mark which line they came from. This will also set it up so newline always ends a sentence.

Please let me know if you need any clarification or if I made any errors transcribing my code.

查看更多
登录 后发表回答