熊貓和nltk：得到最常見的短語

python相當新，我正在與大熊貓數據框與列滿文本。我正在嘗試使用該列並使用nltk來查找常見短語（三個或四個詞）。熊貓和nltk：得到最常見的短語

dat["text_clean"] = 
    dat["Description"].str.replace('[^\w\s]','').str.lower() 

dat["text_clean2"] = dat["text_clean"].apply(word_tokenize) 

finder = BigramCollocationFinder.from_words(dat["text_clean2"]) 
finder 
# only bigrams that appear 3+ times 
finder.apply_freq_filter(3) 
# return the 10 n-grams with the highest PMI 
print finder.nbest(bigram_measures.pmi, 10)

的初步意見似乎很好地工作。但是，當我嘗試使用BigramCollocation時，它會引發以下錯誤。

n [437]: finder = BigramCollocationFinder.from_words(dat["text_clean2"]) 
finder 

Traceback (most recent call last): 

    File "<ipython-input-437-635c3b3afaf4>", line 1, in <module> 
    finder = BigramCollocationFinder.from_words(dat["text_clean2"]) 

    File "/Users/abrahammathew/anaconda/lib/python2.7/site-packages/nltk/collocations.py", line 168, in from_words 
    wfd[w1] += 1 

TypeError: unhashable type: 'list'

任何想法這是什麼或解決方法。

與以下命令同樣錯誤。

gg = dat["text_clean2"].tolist()  
finder = BigramCollocationFinder.from_words(gg) 
finder = BigramCollocationFinder.from_words(dat["text_clean2"].values.reshape(-1,))

以下的工作，但返回沒有共同的短語。

gg = dat["Description"].str.replace('[^\w\s]','').str.lower() 
finder = BigramCollocationFinder.from_words(gg) 
finder 
# only bigrams that appear 3+ times 
finder.apply_freq_filter(2) 
# return the 10 n-grams with the highest PMI 
print finder.nbest(bigram_measures.pmi, 10)

來源

2017-07-25 ATMA

這似乎是你的BigramCollocationFinder類希望單詞的列表，而不是一個名單列表。試試這個：

finder = BigramCollocationFinder.from_words(dat["text_clean2"].values.reshape(-1,))

來源

2017-07-25 14:59:52

您可能必須將列表列表轉換爲元組列表。希望這個作品

dat['text_clean2'] = [tuple(x) for x in dat['text_clean2']] 
finder = BigramCollocationFinder.from_words(dat["text_clean2"])

來源

2017-07-25 15:15:04 Dark

熊貓和nltk：得到最常見的短語

回答

相關問題