Having this:
text = word_tokenize("The quick brown fox jumps over the lazy dog")
And running:
nltk.pos_tag(text)
I get:
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]
This is incorrect. The tags for quick brown lazy
in the sentence should be:
('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')
Testing this through their online tool gives the same result; quick
, brown
and fox
should be adjectives not nouns.
In short:
Note:
As of NLTK version 3.1, default
pos_tag
function is no longer the old MaxEnt English pickle.It is now the perceptron tagger from @Honnibal's implementation, see
nltk.tag.pos_tag
Still it's better but not perfect:
At some point, if someone wants
TL;DR
solutions, see https://github.com/alvations/nltk_cliIn long:
Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:
Using default MaxEnt POS tagger from NLTK, i.e.
nltk.pos_tag
:Using Stanford POS tagger:
Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):
Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):
Or try building a better POS tagger:
Complains about
pos_tag
accuracy on stackoverflow include:Issues about NLTK HunPos include:
Issues with NLTK and Stanford POS tagger include: