The nltk
package's built-in part-of-speech tagger does not seem to be optimized for my use-case (here, for instance). The source code here shows that it's using a saved, pre-trained classifier called maxent_treebank_pos_tagger
.
What created maxent_treebank_pos_tagger/english.pickle
? I'm guessing that there is a tagged corpus out there somewhere that was used to train this tagger, so I think I'm looking for (a) that tagged corpus and (b) the exact code that trains the tagger based on the tagged corpus.
In addition to lots of googling, so far I tried to look at the .pickle
object directly to find any clues inside it, starting like this
from nltk.data import load
x = load("nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle")
dir(x)