My input.txt file contains the following sample text:
you have to let's
come and see me.
Now if I invoke the Stanford POS tagger with the default command:
java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt
I get the following in my output.txt file:
you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.
The problem with the above output is that I have lost my original newline delimiter used in the input file.
Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:
java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt
The problem with this code is that it totally messed up the output:
you_PRP have_VBP to_TO let's_NNS
come_VB and_CC see_VB me._NN
Here let's and me. are tagged inappropriately.
My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?
The answer should have been to use the command:
But there was a bug and it didn't work (ignored the newlines) in version 3.1.3 (and perhaps all earlier versions). It will work in version 3.1.4+.
In the meantime, if the amount of text is small, you might try using the Stanford Parser (where the corresponding flag is named differently so it's
-sentences newline
).One thing you can do is use xml input instead of plain text. Your input in that case will be:
Here each line is enclosed in a line tag. You can now issue the following command:
Note that the argument '-xmlInput' specifies the tag used for POS tagging. In our case, this tag is line. When you run the above command the output will be:
Thus you can separate out your lines by reading content enclosed in the line tags.