I'm hoping someone has experience with this as I'm unable to find any comments online besides a bug report from 2015 regarding the NERtagger which is probably the same.
Anyway, I'm trying to batch process text to get around the poor performing base tagger. From what I understand, tag_sents should help.
from nltk.tag.stanford import StanfordPOSTagger
from nltk import word_tokenize
import nltk
stanford_model = 'stanford-postagger/models/english-bidirectional-distsim.tagger'
stanford_jar = 'stanford-postagger/stanford-postagger.jar'
tagger = StanfordPOSTagger(stanford_model, stanford_jar)
tagger.java_options = '-mx4096m'
text = "The quick brown fox jumps over the lazy dog."
print tagger.tag_sents(text)
Except no matter what I pass to the tag_sents method, the text gets split up into chars instead of words. Anyone know why it doesn't work properly? This works as expected...
tag(text)
I tried splitting the sentence into tokens as well to see if that helped but same treatment
The
tag_sents
function takes a list of list of strings.Here's a useful idiom:
where
text
is a string.Another variation of what alvas said, which worked for me:
tagger.tag_sents([[text]])
.