Creating n-grams word cloud using python

I am trying to generate word cloud using bi-grams. I am able to generate the top 30 discriminative words but unable to display words together while plotting. My word cloud image still looks like a uni-gram cloud. I have used the following script and sci-kit learn packages.

def create_wordcloud(pipeline): 
"""
Create word cloud with top 30 discriminative words for each category
"""

class_labels = numpy.array(['Arts','Music','News','Politics','Science','Sports','Technology'])

feature_names =pipeline.named_steps['vectorizer'].get_feature_names() 
word_text=[]

for i, class_label in enumerate(class_labels):
    top30 = numpy.argsort(pipeline.named_steps['clf'].coef_[i])[-30:]

    print("%s: %s" % (class_label," ".join(feature_names[j]+"," for j in top30)))

    for j in top30:
        word_text.append(feature_names[j])
    #print(word_text)
    wordcloud1 = WordCloud(width = 800, height = 500, margin=10,random_state=3, collocations=True).generate(' '.join(word_text))

    # Save word cloud as .png file
    # Image files are saved to the folder "classification_model" 
    wordcloud1.to_file(class_label+"_wordcloud.png")

    # Plot wordcloud on console
    plt.figure(figsize=(15,8))
    plt.imshow(wordcloud1, interpolation="bilinear")
    plt.axis("off")
    plt.show()
    word_text=[]

This is my pipeline code

pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(2, 2),sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))
])

These are some of the features I got for the category "Arts"

Arts: cosmetics businesspeople, television personality, reality television, television presenters, actors london, film producers, actresses television, indian film, set index, actresses actresses, television actors, century actors, births actors, television series, century actresses, actors television, stand comedian, television personalities, television actresses, comedian actor, stand comedians, film actresses, film actors, film directors

I think you need somehow to join your n-gramms in feature_names with any other symbol than space. I propose underscore, for example. For now, this part makes your n-gramms separate words again, I think:

' '.join(word_text)

I think you have to substitute space with underscore here:

word_text.append(feature_names[j])

changing to this:

word_text.append(feature_names[j].replace(' ', '_'))

Creating n-grams word cloud using python

问题:

回答1:

收藏的人(0)

Creating n-grams word cloud using python

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮