Does spacy take as input a list of tokens?

I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either with spacy or any other NLP package ?

For now, I am using this spacy-based function to put a sentence (a unicode string) in the Conll format:

import spacy
nlp = spacy.load('en')
def toConll(string_doc, nlp):
   doc = nlp(string_doc)
   block = []
   for i, word in enumerate(doc):
          if word.head == word:
                  head_idx = 0
          else:
                  head_idx = word.head.i - doc[0].i + 1
          head_idx = str(head_idx)
          line = [str(i+1), str(word), word.lemma_, word.tag_,
                      word.ent_type_, head_idx, word.dep_]
          block.append(line)
   return block
conll_format = toConll(u"Donald Trump is the new president of the United States of America")

Output:
[['1', 'Donald', u'donald', u'NNP', u'PERSON', '2', u'compound'],
 ['2', 'Trump', u'trump', u'NNP', u'PERSON', '3', u'nsubj'],
 ['3', 'is', u'be', u'VBZ', u'', '0', u'ROOT'],
 ['4', 'the', u'the', u'DT', u'', '6', u'det'],
 ['5', 'new', u'new', u'JJ', u'', '6', u'amod'],
 ['6', 'president', u'president', u'NN', u'', '3', u'attr'],
 ['7', 'of', u'of', u'IN', u'', '6', u'prep'],
 ['8', 'the', u'the', u'DT', u'GPE', '10', u'det'],
 ['9', 'United', u'united', u'NNP', u'GPE', '10', u'compound'],
 ['10', 'States', u'states', u'NNP', u'GPE', '7', u'pobj'],
 ['11', 'of', u'of', u'IN', u'GPE', '10', u'prep'],
 ['12', 'America', u'america', u'NNP', u'GPE', '11', u'pobj']]

I would like to do the same while having as input a list of tokens...

标签： python-2.7 tokenize spacy dependency-parsing

1条回答

聊天终结者

2楼-- · 2019-06-17 02:43

You can run Spacy's processing pipeline against already tokenised text. You need to understand, though, that the underlying statistical models have been trained on a reference corpus that has been tokenised using some strategy and if your tokenisation strategy is significantly different, you may expect some performance degradation.

Here's how to go about it using Spacy 2.0.5 and Python 3. If using Python 2, you may need to use unicode literals.

import spacy; nlp = spacy.load('en_core_web_sm')
# spaces is a list of boolean values indicating if subsequent tokens
# are followed by any whitespace
# so, create a Spacy document with your tokenisation
doc = spacy.tokens.doc.Doc(
    nlp.vocab, words=['nuts', 'itch'], spaces=[True, False])
# run the standard pipeline against it
for name, proc in nlp.pipeline:
    doc = proc(doc)

0人赞添加讨论(0) 举报

Does spacy take as input a list of tokens?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间