Python：找到文本中單詞列表的最佳/有效方式？

我有一個約300個單詞的列表和大量的文本，我想掃描以知道每個單詞出現多少次。Python：找到文本中單詞列表的最佳/有效方式？

我使用re模塊蟒蛇：

for word in list_word: 
    search = re.compile(r"""(\s|,)(%s).?(\s|,|\.|\))""" % word) 
    occurrences = search.subn("", text)[1]

，但我想知道是否有這樣做的更有效或更優雅的方式？

來源

2010-07-30 Mermoz

你可以使用單詞而不是檢查周圍的空格和標點符號。 '\ bWORD \ b' – mpen 2010-07-30 14:20:51

如果您想超越詞頻並查看文本分類，您可能需要查看以下內容： http://streamhacker.com/2010/06/16/text-classification-sentiment-分析 - 消除 - 低信息功能/ – monkut 2010-07-30 14:30:49

如果您將它放在內存中，**巨大**可以如何處理？ – FMc 2010-07-30 17:16:13

如果你有大量的文本，我不會用在這種情況下，正則表達式，但簡單地拆分文本：

words = {"this": 0, "that": 0} 
for w in text.split(): 
    if w in words: 
    words[w] += 1

的話會給你的頻率爲每字

來源

2010-07-30 14:25:40

絕對更高效，只掃描一次文本。上面的代碼片段似乎缺少檢查該單詞是300個「重要」單詞之一的檢查。 – pdbartlett 2010-07-30 14:28:12

@pdbartlett'如果用單詞w進行檢查。 – Wilduck 2010-07-30 14:41:42

分割空白並不總是會導致完美的結果。如果你需要複雜的分割，你可以看看下面提出的NLTK。 – 2010-07-30 20:40:46

谷歌搜索：蟒蛇頻率給了我這個頁面的第一個結果：http://www.daniweb.com/code/snippet216747.html

這似乎是你在找什麼。

來源

2010-07-30 14:22:24

它具有所有這些正則表達式的非pythonish。分割成單獨的單詞最好用str.split（）來實現，而不是自定義正則表達式 – 2010-07-30 14:36:52

你是對的，如果Python字符串函數足夠，它們應該用來代替正則表達式。 – 2010-07-30 16:36:51

您也可以將文本拆分爲單詞並搜索結果列表。

來源

2010-07-30 14:23:04

正則表達式可能不是你想要的。 Python有一些內置的字符串操作，其中的速度更快，我相信.count（）具有你所需要的。

http://docs.python.org/library/stdtypes.html#string-methods

來源

2010-07-30 14:24:01 chimeracoder

嘗試從文本中刪除所有標點符號，然後拆分空格。後來乾脆

for word in list_word: 
    occurence = strippedText.count(word)

或者，如果你正在使用Python 3.0，我認爲你可以這樣做：

occurences = {word: strippedText.count(word) for word in list_word}

來源

2010-07-30 14:27:18 jacobangel

in 2.6 <= python <3.0你可以在list_word中爲word做'occurences = dict（（word，strippedText.count（word））'） – Wilduck 2010-07-30 14:44:55

如果Python是不是必須的，你可以用awk

$ cat file 
word1 
word2 
word3 
word4 

$ cat file1 
blah1 blah2 word1 word4 blah3 word2 
junk1 junk2 word2 word1 junk3 
blah4 blah5 word3 word6 end 

$ awk 'FNR==NR{w[$1];next} {for(i=1;i<=NF;i++) a[$i]++}END{for(i in w){ if(i in a) print i,a[i] } } ' file file1 
word1 2 
word2 2 
word3 1 
word4 1

來源

2010-07-30 14:41:57 ghostdog74

它聽起來像自然語言工具包可能有你需要的東西。

http://www.nltk.org/

來源

2010-07-30 15:20:27 Glenjamin

'nltk.FreqDist'類。 – 2010-07-30 20:38:44

也許你能適應這個我multisearch發生器功能。

from itertools import islice 
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5." 
def multis(search_sequence,text,start=0): 
    """ multisearch by given search sequence values from text, starting from position start 
     yielding tuples of text before sequence item and found sequence item""" 
    x='' 
    for ch in text[start:]: 
     if ch in search_sequence: 
      if x: yield (x,ch) 
      else: yield ch 
      x='' 
     else: 
      x+=ch 
    else: 
     if x: yield x 

# split the first two sentences by the dot/question/exclamation. 
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation 
print "result of split: ", two_sentences 

print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)

來源

2010-07-30 15:56:07

Python：找到文本中單詞列表的最佳/有效方式？

回答

相關問題