-->

Is there a bi gram or tri gram feature in Spacy?

2020-07-11 07:58发布

问题:

The below code breaks the sentence into individual tokens and the output is as below

 "cloud"  "computing"  "is" "benefiting"  " major"  "manufacturing"  "companies"


import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
    print(token.text)

What I would ideally want is, to read 'cloud computing' together as it is technically one word.

Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?

回答1:

Spacy allows detection of noun chunks. So to parse your noun phrases as single entities do this:

/1. Detect the noun chunks https://spacy.io/usage/linguistic-features#noun-chunks

/2. Merge the noun chunks /3. Do dependency parsing again, it would parse "cloud computing" as single entity now.

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
...     noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
... 
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]
>>> 


回答2:

If you have a spacy doc, you can pass it to textacy:

ngrams = list(textacy.extract.ngrams(doc, 2, min_freq=2))


回答3:

I had a similar problem (bigrams, trigrams, like your "cloud computing"). I made a simple list of the n-grams, word_3gram, word_2grams etc., with the gram as basic unit (cloud_computing). Assume I have the sentence "I like cloud computing because it's cheap". The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial. To recover all other words you just take the first part of the other words,

"I_like".split("_")[0] -> I; 
"like_cloud".split("_")[0] -> like
"cloud_computing" -> in bigram list, keep it. 
  skip next bi-gram "computing_because" ("computing" is already used)
"because_it's".split("_")[0]" -> "because" etc.

To also capture the last word in the sentence ("cheap") I added the token "EOL". I implemented this in python, and the speed was OK (500k words in 3min), i5 processor with 8G. Anyway, you have to do it only once. I find this more intuitive than the official (spacy-style) chunk approach. It also works for non-spacy frameworks.

I do this before the official tokenization/lemmatization, as you would get "cloud compute" as possible bigram. But I'm not certain if this is the best/right approach. Suggestions?

Andreas

PS: drop a line if you wish the full code, I'll sanitize the code and put it up here (and maybe github).