如何將短語列表分成單詞，以便我可以使用計數器？

我的數據是來自webforum的對話線索。我創建了一個函數來清理停用詞，標點符號等數據。然後我創建了一個循環來清理我的csv文件中的所有帖子，並將它們放入列表中。然後我做了數字。我的問題是列表包含unicode短語而不是單個單詞。我怎樣才能分開這些短語，所以它們是我可以數的單個詞。這裏是我下面的代碼：如何將短語列表分成單詞，以便我可以使用計數器？

def post_to_words(raw_post): 
     HTML_text = BeautifulSoup(raw_post).get_text() 
     letters_only = re.sub("[^a-zA-Z]", " ", HTML_text) 
     words = letters_only.lower().split() 
     stops = set(stopwords.words("english")) 
     meaningful_words = [w for w in words if not w in stops] 
     return(" ".join(meaningful_words)) 

clean_Post_Text = post_to_words(fiance_forum["Post_Text"][0]) 
clean_Post_Text_split = clean_Post_Text.lower().split() 
num_Post_Text = fiance_forum["Post_Text"].size 
clean_posts_list = [] 

for i in range(0, num_Post_Text): 
    clean_posts_list.append(post_to_words(fiance_forum["Post_Text"][i])) 

from collections import Counter 
    counts = Counter(clean_posts_list) 
    print(counts)

我的輸出是這樣的：u'please按照指示通知移動接收器「：1 我希望它看起來像這樣：

請：1

如下：1

指令：1

等等......非常感謝！

來源

2016-05-24 glongo

'：1'是什麼？ – Coder256

@ Coder256，Gina的代碼顯示unicode字符串的一個實例，而不是計算字符串中每個單詞的一個實例。 –

如果你想單獨列出單詞，你爲什麼要用'str.join'？ –

你已經有一個單詞列表，所以你不需要任何分裂，忘記打電話str.join即" ".join(meaningful_words)，只需創建一個計數器字典和每次調用post_to_words時更新，你也在做很多工作，你需要做的就是迭代fiance_forum["Post_Text"]，將每個元素傳遞給函數。你只還需要一次創建一組停止字，而不是在每次迭代：

from collections import Counter 

def post_to_words(raw_pos, st): 
    HTML_text = BeautifulSoup(raw_post).get_text() 
    letters_only = re.sub("[^a-zA-Z]", " ", HTML_text) 
    words = letters_only.lower().split() 
    return (w for w in words if w not in st) 



cn = Counter() 
st = set(stopwords.words("english")) 
for post in fiance_forum["Post_Text"]: 
    cn.update(post_to_words(post, st)

這也避免了需要由你去做計數創造了巨大的單詞列表。

來源

2016-05-24 21:34:02

好吧我試過這個，但現在我想輸出。此時我會輸入什麼內容？對不起，我是新來的。 – glongo

@GinaBoBina，計數存儲在cn –

你是幾乎沒有，所有你需要的是分割字符串成一句話：

>>> from collections import Counter 
>>> Counter('please follow instructions notice move receiver'.split()) 
Counter({'follow': 1, 
     'instructions': 1, 
     'move': 1, 
     'notice': 1, 
     'please': 1, 
     'receiver': 1})

來源

2016-05-24 20:57:32 aldanor

謝謝。我可能應該提及有超過400000個帖子，因爲我無法爲每個帖子輸入整個帖子，還有另一種方法可以讓它列出這樣的單詞嗎？對不起，如果答案很明顯......我是新的。再次感謝！ – glongo

@GinaBoBina我不確定我是否關注 – aldanor

@aldanor ...哦，對不起。我只是說Count（'請按照說明通知移動receiver'.split（））將這個句子分開，這很棒。但是，我有超過400000個句子來分裂。這是否更有意義？再次感謝！ – glongo

如何將短語列表分成單詞，以便我可以使用計數器？

回答

相關問題