2017-07-19 70 views
0

我正在嘗試使用雙元生成詞雲。我能夠生成前30個區分性詞語,但無法在繪圖時一起顯示單詞。我的文字雲圖像仍然看起來像一個單克雲。我使用了以下腳本和sci-kit學習軟件包。使用python創建n-gram詞雲

def create_wordcloud(pipeline): 
""" 
Create word cloud with top 30 discriminative words for each category 
""" 

class_labels = numpy.array(['Arts','Music','News','Politics','Science','Sports','Technology']) 

feature_names =pipeline.named_steps['vectorizer'].get_feature_names() 
word_text=[] 

for i, class_label in enumerate(class_labels): 
    top30 = numpy.argsort(pipeline.named_steps['clf'].coef_[i])[-30:] 

    print("%s: %s" % (class_label," ".join(feature_names[j]+"," for j in top30))) 

    for j in top30: 
     word_text.append(feature_names[j]) 
    #print(word_text) 
    wordcloud1 = WordCloud(width = 800, height = 500, margin=10,random_state=3, collocations=True).generate(' '.join(word_text)) 

    # Save word cloud as .png file 
    # Image files are saved to the folder "classification_model" 
    wordcloud1.to_file(class_label+"_wordcloud.png") 

    # Plot wordcloud on console 
    plt.figure(figsize=(15,8)) 
    plt.imshow(wordcloud1, interpolation="bilinear") 
    plt.axis("off") 
    plt.show() 
    word_text=[] 

這是我的管道代碼

pipeline = Pipeline([ 

# SVM using TfidfVectorizer 
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(2, 2),sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)), 
('clf',  LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)) 
]) 

這些都是我的類別「藝術」

Arts: cosmetics businesspeople, television personality, reality television, television presenters, actors london, film producers, actresses television, indian film, set index, actresses actresses, television actors, century actors, births actors, television series, century actresses, actors television, stand comedian, television personalities, television actresses, comedian actor, stand comedians, film actresses, film actors, film directors 

回答

0

我想你需要以某種方式加入你的正功能在feature_names中使用任何其他符號而不是空格。例如,我建議強調。 現在,這一部分讓您再次正gramms獨立的話,我想:

' '.join(word_text) 

我覺得你有下劃線這裏來替代空間:

word_text.append(feature_names[j]) 

更改爲此:

word_text.append(feature_names[j].replace(' ', '_')) 
+0

它沒有工作。它用(_)替換所有單詞而沒有任何中斷。 – VKB

+0

我編輯了我的答案。你有沒有嘗試過這樣的事情? – CrazyElf

+0

謝謝你的作品。 – VKB