Text corpora are often distributed as large files containing specific documents on each new line. For instance, I have a file with 10 million product reviews, one per line, and each review contains multiple sentences.
When processing such files with Stanford CoreNLP, using the command line, for instance
java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma -file test.txt
the output, whether in text or xml format, will number all sentences from 1 to n
, ignoring the original line numbering that separates the documents.
I would like to keep track of the original file's line numbering (e.g. in xml format, to have an output tree like <original_line id=1>
, then <sentence id=1>
, then <token id=1>
). Or else, to be able to reset the numbering of sentences at the start of each new line in the original file.
I have tried the answer to a similar question about Stanford's POS tagger, without success. Those options do not keep track of the original line numbers.
A quick solution could be to split the original file in multiple files, then processing each of them with CoreNLP and the -filelist
input option. However, for large files with millions of documents, creating millions of individual files just to preserve the original line/document numbering seems inefficient.
I suppose it would be possible to modify the source code of Stanford CoreNLP, but I am unfamiliar with Java.
Any solution to preserve the original line numbering in the output would be very helpful, whether through the command line or by showing an example Java code that would achieve that.
The Question is already answered but i had the same problem and came up with a command line solution that worked for me. The trick was to specify the tokenizerFactory and give it the option tokenizeNLs=true
It looks like this:
I've dug through the code base, and I can't find a command line flag that will help you.
I wrote some sample Java code that should do the trick.
I put this in DocPerLineProcessor.java, which I put into stanford-corenlp-full-2015-04-20. I also put a file called sample-doc-per-line.txt which had 4 sentences per line.
First make sure to compile:
cd stanford-corenlp-full-2015-04-20
javac -cp "*:." DocPerLineProcessor.java
Here is the command to run:
java -cp "*:." DocPerLineProcessor sample-doc-per-line.txt
The output sample-doc-per-line.txt.xml should be the desired xml format, but sentences now have which line number they're on.
Here is the code:
Now when I run this, the sentence tags also have the appropriate line number. So the sentences still have a global id, but you can mark which line they came from. This will also set it up so newline always ends a sentence.
Please let me know if you need any clarification or if I made any errors transcribing my code.