I have a list of sentences:

text = ['cant railway station','citadel hotel',' police stn'].

I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:

text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)

which yields

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

Can't railway station and citadel hotel form one bigram. What I want is

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?

标签： python list list-comprehension nltk collocation

9条回答

何必那么认真

2楼-- · 2020-02-17 09:50

Just fixing Dan's code:

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

0人赞添加讨论(0) 举报

Root（大扎）

3楼-- · 2020-02-17 09:51

Using list comprehensions and zip:

>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]

0人赞添加讨论(0) 举报

叼着烟拽天下

4楼-- · 2020-02-17 09:52

Rather than turning your text into lists of strings, start with each sentence separately as a string. I've also removed punctuation and stopwords, just remove these portions if irrelevant to you:

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

To use it, do like so:

for line in sentence:
    features = get_bigrams(line)
    # train set here

Note that this goes a little further and actually statistically scores the bigrams (which will come in handy in training the model).

0人赞添加讨论(0) 举报

我只想做你的唯一

5楼-- · 2020-02-17 09:52

Read the dataset

df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")

Collect all available months

df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])

Create tokens of all tweets per month

tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))

Create bigrams per month

bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))

Count bigrams per month

count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))

Wrap up the result in neat dataframes

month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])

0人赞添加讨论(0) 举报

你好瞎i

6楼-- · 2020-02-17 09:56

Without nltk:

ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
    arr = line.split()
    for i in range(len(arr)-1):
        ans.append([[arr[i]], [arr[i+1]]])


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]

0人赞添加讨论(0) 举报

Bombasti

7楼-- · 2020-02-17 10:01

I think the best and most general way to do it is the following:

n      = 2
ngrams = []

for l in L:
    for i in range(n,len(l)+1):
        ngrams.append(l[i-n:i])

or in other words:

ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]

This should work for any n and any sequence l. If there are no ngrams of length n it returns the empty list.

0人赞添加讨论(0) 举报

1 2 下一页

Forming Bigrams of words in list of sentences with

Read the dataset

Collect all available months

Create tokens of all tweets per month

Create bigrams per month

Count bigrams per month

Wrap up the result in neat dataframes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间