將Sklearn的CountVectorizer的詞彙表設置爲詞組的短語

你好我一直在玩弄使用scikit-learn的文本分析，並且我有使用CountVectorizer來檢測文檔是否包含一組關鍵字和短語的想法。將Sklearn的CountVectorizer的詞彙表設置爲詞組的短語

我知道我們可以做到這一點：

words = ['cat', 'dog', 'walking'] 
example = ['I was walking my dog and cat in the park'] 
vect = CountVectorizer(vocabulary=words) 
dtm = vect.fit_transform(example) 
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

...

cat dog walking 
    1 1  1

我不知道是否有可能調整的東西，這樣我可以用詞組只是個別的，而不是詞語

從上面的例子：

phrases = ['cat in the park', 'walking my dog'] 
example = ['I was walking my dog and cat in the park'] 
vect = CountVectorizer(vocabulary=phrases) 
dtm = vect.fit_transform(example) 
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 
... 

     cat in the park walking my dog 
      1     1

現在使用的短語代碼只是輸出

cat in the park walking my dog 
    0     0

預先感謝您！

來源

2017-04-18 cgclip

試試這個：完美的例子

In [104]: lens = [len(x.split()) for x in phrases] 

In [105]: mn, mx = min(lens), max(lens) 

In [106]: vect = CountVectorizer(vocabulary=phrases, ngram_range=(mn, mx)) 

In [107]: dtm = vect.fit_transform(example) 

In [108]: pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 
Out[108]: 
    cat in the park walking my dog 
0    1    1 

In [109]: print(mn, mx) 
3 4

來源

2017-04-18 19:03:12 MaxU

作品上面，但是當我使用的方法在我建立，而詞彙的功能讓我們設置它沒有檢測到文檔中的短語。我會嘗試自己排查一下，看看發生了什麼。 – cgclip

@cgclip，請儘量提供__reproducible__樣本數據集。請考慮[接受]（http://meta.stackexchange.com/a/5235）答案，如果你認爲它已經回答了你的問題 – MaxU

得到了它，我接受了答案，再次感謝你的幫助！ – cgclip

將Sklearn的CountVectorizer的詞彙表設置爲詞組的短語

回答

相關問題