Using ngram in Python my aim is to find out verbs and their corresponding adverbs from an input text.
What I have done:
Input text:""He is talking weirdly. A horse can run fast. A big tree is there. The sun is beautiful. The place is well decorated.They are talking weirdly. She runs fast. She is talking greatly.Jack runs slow.""
Code:-
`finder2 = BigramCollocationFinder.from_words(wrd for (wrd,tags) in posTagged if tags in('VBG','RB','VBN',))
scored = finder2.score_ngrams(bigram_measures.raw_freq)
print sorted(finder2.nbest(bigram_measures.raw_freq, 5))`
From my code, I got the output:
[('talking', 'greatly'), ('talking', 'weirdly'), ('weirdly', 'talking'),('runs','fast'),('runs','slow')]
which is the list of verbs and their corresponding adverbs.
What I am looking for:
I want to figure out verb and all corresponding adverbs from this. For example ('talking'- 'greatly','weirdly),('runs'-'fast','slow')etc.
You already have a list of all verb-adverb bigrams, so you're just asking how to consolidate them into a dictionary that gives all adverbs for each verb. But first let's re-create your bigrams in a more direct way:
pairs = list()
for (w1, tag1), (w2, tag2) in nltk.bigrams(posTagged):
if t1.startswith("VB") and t2 == "RB":
pairs.append((w1, w2))
Now for your question: We'll build a dictionary with the adverbs that follow each verb. I'll store the adverbs in a set, not a list, to get rid of duplications.
from collections import defaultdict
consolidated = defaultdict(set)
for verb, adverb in pairs:
consolidated[verb].add(adverb)
The defaultdict
provides an empty set for verbs that haven't been seen before, so we don't need to check by hand.
Depending on the details of your assignment, you might also want to case-fold and lemmatize your verbs so that the adverbs from "Driving recklessly" and "I drove carefully" are recorded together:
wnl = nltk.stem.WordNetLemmatizer()
...
for verb, adverb in pairs:
verb = wnl.lemmatize(verb.lower(), "v")
consolidated[verb].add(adverb)
I think you are losing information you will need for this. You need to retain the part-of-speech data somehow, so that bigrams like ('weirdly', 'talking')
can be processed in the correct manner.
It may be that the bigram finder can accept the tagged word tuples (I'm not familiar with nltk). Or, you may have to resort to creating an external index. If so, something like this might work:
part_of_speech = {word:tag for word,tag in posTagged}
best_bigrams = finger2.nbest(... as you like it ...)
verb_first_bigrams = [b if part_of_speech[b[1]] == 'RB' else (b[1],b[0]) for b in best_bigrams]
Then, with the verbs in front, you can transform it into a dictionary or list-of-lists or whatever:
adverbs_for = {}
for verb,adverb in verb_first_bigrams:
if verb not in adverbs_for:
adverbs_for[verb] = [adverb]
else:
adverbs_for[verb].append(adverb)