I have a list of sentences:
text = ['cant railway station','citadel hotel',' police stn'].
I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:
text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)
which yields
[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])
Can't railway station and citadel hotel form one bigram. What I want is
[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...
The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?
Just fixing Dan's code:
Using list comprehensions and zip:
Rather than turning your text into lists of strings, start with each sentence separately as a string. I've also removed punctuation and stopwords, just remove these portions if irrelevant to you:
To use it, do like so:
Note that this goes a little further and actually statistically scores the bigrams (which will come in handy in training the model).
Read the dataset
Collect all available months
Create tokens of all tweets per month
Create bigrams per month
Count bigrams per month
Wrap up the result in neat dataframes
Without nltk:
I think the best and most general way to do it is the following:
or in other words:
This should work for any
n
and any sequencel
. If there are no ngrams of lengthn
it returns the empty list.