I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either with spacy or any other NLP package ?
For now, I am using this spacy-based function to put a sentence (a unicode string) in the Conll format:
import spacy
nlp = spacy.load('en')
def toConll(string_doc, nlp):
doc = nlp(string_doc)
block = []
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
head_idx = word.head.i - doc[0].i + 1
head_idx = str(head_idx)
line = [str(i+1), str(word), word.lemma_, word.tag_,
word.ent_type_, head_idx, word.dep_]
return block
conll_format = toConll(u"Donald Trump is the new president of the United States of America")
[['1', 'Donald', u'donald', u'NNP', u'PERSON', '2', u'compound'],
['2', 'Trump', u'trump', u'NNP', u'PERSON', '3', u'nsubj'],
['3', 'is', u'be', u'VBZ', u'', '0', u'ROOT'],
['4', 'the', u'the', u'DT', u'', '6', u'det'],
['5', 'new', u'new', u'JJ', u'', '6', u'amod'],
['6', 'president', u'president', u'NN', u'', '3', u'attr'],
['7', 'of', u'of', u'IN', u'', '6', u'prep'],
['8', 'the', u'the', u'DT', u'GPE', '10', u'det'],
['9', 'United', u'united', u'NNP', u'GPE', '10', u'compound'],
['10', 'States', u'states', u'NNP', u'GPE', '7', u'pobj'],
['11', 'of', u'of', u'IN', u'GPE', '10', u'prep'],
['12', 'America', u'america', u'NNP', u'GPE', '11', u'pobj']]
I would like to do the same while having as input a list of tokens...