A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python?
Following is the code i tried but it is not doing as expected:
<pre>
input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
def find_skipgrams(input_list, N,K):
bigram_list = []
nlist=[]
K=1
for k in range(K+1):
for i in range(len(input_list)-1):
if i+k+1<len(input_list):
nlist=[]
for j in range(N+1):
if i+k+j+1<len(input_list):
nlist.append(input_list[i+k+j+1])
bigram_list.append(nlist)
return bigram_list
</pre>
The above code is not rendering correctly, but print find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1)
gives following output
[['this', 'happened', 'more'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['less']]
The code listed here also does not give correct output: https://github.com/heaven00/skipgram/blob/master/skipgram.py
print skipgram_ndarray("What is your name") gives: ['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name']
name is a unigram!
From the paper that OP links, the following string:
Yields:
With slight modification to NLTK's
ngrams
code (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):Let's do some doctest to match the example in the paper:
But do note that if
n+k > len(sequence)
, it will yield the same effects asskipgrams(sequence, n, k-1)
(this is not a bug, it's a fail safe feature), e.g.This allows
n == k
but it disallown > k
as shown in the lines :For understanding sake, let's try to understand the "mystical" line:
Given a list of unique items, combinations produce this:
And since the indices of a list of tokens is always unique, e.g.
It's possible to compute the possible combinations (without replacement) of the range:
If we map the indices back to the list of tokens:
Then we concatenate with the
current_token
, we get the skipgrams for the current token and context+skip window:So after that we move on to the next word.
How about using someone else's implementation https://github.com/heaven00/skipgram/blob/master/skipgram.py , where
k = skip_size
andn=ngram_order
:EDITED
The latest NLTK version 3.2.5 has the
skipgrams
implemented.Here's a cleaner implementation from @jnothman from the NLTK repo: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538
[out]:
Although this would part entirely from your code and defer it to an external library; you can use Colibri Core (https://proycon.github.io/colibri-core) for skipgram extraction. It's a library written specifically for efficient n-gram and skipgram extraction from big text corpora. The code base is in C++ (for speed/efficiency), but a Python binding is available.
You rightfully mentioned efficiency, as skipgram extraction quickly shows exponential complexity, which may not be a big issue if you only pass a sentence as you did in your
input_list
, but becomes problematic if you release it on large corpus data. To mitigate this you can set parameters such as an occurrence threshold, or demand each skip of a skipgram to be fillable by at least x distinct n-grams.There's a more extensive Python tutorial about all this on the website.
Disclaimer: I'm the author of Colibri Core
Refer this for complete info.
The below example has already been mentioned in it about it's usage and works like a charm!