Python - 單詞出現次數

我正在嘗試執行一個功能，可以在文本中查找（全部）單詞（不區分大小寫）的出現次數。Python - 單詞出現次數

例子：

>>> text = """Antoine is my name and I like python. 
Oh ! your name is antoine? And you like Python! 
Yes is is true, I like PYTHON 
and his name__ is John O'connor""" 

assert(2 == Occs("Antoine", text)) 
assert(2 == Occs("ANTOINE", text)) 
assert(0 == Occs("antoin", text)) 
assert(1 == Occs("true", text))  
assert(0 == Occs("connor", text)) 
assert(1 == Occs("you like Python", text)) 
assert(1 == Occs("Name", text))

這是一個基本的嘗試：

def Occs(word,text): 
    return text.lower().count(word.lower())

，因爲它不是基於話這一個不工作。
這個功能一定要快，文字可以很大。

我應該把它分成一個數組嗎？
有沒有簡單的方法來做這個功能？

編輯（蟒2.3.4）

來源

2012-01-05 Pierre de LESPINAY

正則表達式？ http://docs.python.org/howto/regex.html – Li0liQ 2012-01-05 12:51:33

你有多少個查詢？如果你有很多它們，我建議你將小寫的文本拆分成單詞（O（n）），對它們進行排序並在結果列表中進行搜索（二進制搜索+迭代相鄰的記錄） – 2012-01-05 12:56:24

爲什麼在天堂中你必須綁定到Python 2.3？ – jsbueno 2012-01-05 16:00:12

from collections import Counter 
import re 

Counter(re.findall(r"\w+", text))

，或者對於不區分大小寫的版本

Counter(w.lower() for w in re.findall(r"\w+", text))

在Python < 2.7，使用defaultdict代替Counter：

freq = defaultdict(int) 
for w in re.findall(r"\w+", text): 
    freq[w.lower()] += 1

來源

2012-01-05 12:52:41

對於不區分大小寫的版本，爲什麼不使用're.IGNORECASE'標誌？ http://docs.python.org/library/re.html#re.IGNORECASE – 2012-01-05 13:12:57

@DaveWebb：'IGNORECASE'會在匹配時忽略大小寫，但不會小寫'findall'的結果。 – 2012-01-05 13:14:12

問題是要求計算一個特定的單詞而不是所有單詞;我想在這種情況下'IGNORECASE'更有意義。 – 2012-01-05 13:26:03

見this question。

一個實現是，如果您的文件是面向行的，那麼逐行讀取並在每行上使用普通的split()不會很昂貴。這當然假定單詞不能跨越換行符，不管怎樣（沒有連字符）。

來源

2012-01-05 12:54:58 unwind

謝謝，但它不是專門面向行的 – 2012-01-05 16:14:23

這裏是一個非Python的方式 - 我假定這是一個家庭作業問題，反正...

def count(word, text): 
    result = 0 
    text = text.lower() 
    word = word.lower() 
    index = text.find(word, 0) 
    while index >= 0: 
     result += 1 
     index = text.find(word, index) 
    return result

當然，對於真正的大文件，這將是緩慢的主要原因text.lower()調用。但你總是可以想出一個不區分大小寫的find並解決這個問題！

爲什麼我這樣做？因爲我認爲它捕捉了你想要做的最好的事情：通過text，計算你找到word的次數。

此外，這種方法解決了一些與標點符號有關的令人討厭的問題：split將使他們在那裏，你不會匹配，你會嗎？

來源

2012-01-05 12:59:30

可以匹配NumberOfOccurencesOfWordInText（「antoin」，text）嗎？它不應該。無論如何+1爲較低（）的性能問題。 – 2012-01-05 16:17:39

@Glide對，我的壞。儘管如此，這種技術仍然可行，你只需要檢查單詞邊界的匹配（開始和結束）。沒有簡單的方法來做到這一點。你只需要掃描文本。考慮在運行時構建一個專門的掃描器，以便通過文本檢查查找單詞。像'grep'。 – 2012-01-06 08:27:10

對於單詞邊界的+1，我認爲這是關鍵 – 2012-01-06 08:39:11

謝謝你的幫助。
這裏是我的解決方案：

import re 

starte = "(?<![a-z])((?<!')|(?<=''))" 
ende = "(?![a-z])((?!')|(?=''))" 

def NumberOfOccurencesOfWordInText(word, text): 
    """Returns the nb. of occurences of whole word(s) (case insensitive) in a text""" 
    pattern = (re.match('[a-z]', word, re.I) != None) * starte\ 
       + word\ 
       + (re.match('[a-z]', word[-1], re.I) != None) * ende 
    return len(re.findall(pattern, text, re.IGNORECASE))

來源

2012-01-06 06:41:40

適合我的作品，讓'單詞'有引號和空格。你有沒有找到其他解決方案？這不正是那種過於冒險的正則表達式嗎？ – olanod 2012-06-25 18:26:19

他們給我確切地解決同樣的問題，所以衝浪有關的問題很多。這就是爲什麼想在這裏分享我的解決方案。雖然我的解決方案需要一段時間才能執行，但它的內部處理時間比我想象的要好。我可能錯了。反正這裏有雲解決方案：

def CountOccurencesInText(word,text): 
    """Number of occurences of word (case insensitive) in text""" 

    acceptedChar = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '-', ' ') 

    for x in ",!?;_\n«»():\".": 
     if x == "\n" or x == "«" or x == "»" or x == "(" or x == ")" or x == "\"" or x == ":" or x == ".": 
      text = text.replace(x," ") 
     else: 
      text = text.replace(x,"") 

    """this specifically handles the imput I am attaching my c.v. to this e-mail.""" 
    if len(word) == 32: 
     for x in ".": 
      word = word.replace(x," ") 

    punc_Removed_Text = "" 
    text = text.lower() 

    for i in range(len(text)): 
     if text[i] in acceptedChar: 
     punc_Removed_Text = punc_Removed_Text + text[i] 

     """"this specifically handles the imput: Do I have to take that as a 'yes'""" 
     elif text[i] == '\'' and text[i-1] == 's': 
      punc_Removed_Text = punc_Removed_Text + text[i] 

     elif text[i] == '\'' and text[i-1] in acceptedChar and text[i+1] in acceptedChar: 
      punc_Removed_Text = punc_Removed_Text + text[i] 

     elif text[i] == '\'' and text[i-1] == " " and text[i+1] in acceptedChar: 
      punc_Removed_Text = punc_Removed_Text + text[i] 

     elif text[i] == '\'' and text[i-1] in acceptedChar and text[i+1] == " " : 
      punc_Removed_Text = punc_Removed_Text + text[i] 

    frequency = 0 
    splitedText = punc_Removed_Text.split(word.lower()) 

    for y in range(0,len(splitedText)-1,1): 
     element = splitedText[y] 

     if len(element) == 0: 
      if(splitedText[y+1][0] == " "): 
       frequency += 1 

     elif len(element) == 0: 
      if(len(splitedText[y+1][0])==0): 
       frequency += 1 

     elif len(splitedText[y+1]) == 0: 
      if(element[len(element)-1] == " "): 
       frequency += 1 

     elif (element[len(element)-1] == " " and splitedText[y+1][0] == " "): 
      frequency += 1 
    return frequency

這裏是簡介：

128006 function calls in 7.831 seconds 

    Ordered by: standard name 

    ncalls tottime percall cumtime percall filename:lineno(function) 
     1 0.000 0.000 7.831 7.831 :0(exec) 
    32800 0.062 0.000 0.062 0.000 :0(len) 
    11200 0.047 0.000 0.047 0.000 :0(lower) 
     1 0.000 0.000 0.000 0.000 :0(print) 
    72800 0.359 0.000 0.359 0.000 :0(replace) 
     1 0.000 0.000 0.000 0.000 :0(setprofile) 
    5600 0.078 0.000 0.078 0.000 :0(split) 
     1 0.000 0.000 7.831 7.831 <string>:1(<module>) 
     1 0.000 0.000 7.831 7.831 ideone-gg.py:225(doit) 
    5600 7.285 0.001 7.831 0.001 ideone-gg.py:3(CountOccurencesInText) 
     1 0.000 0.000 7.831 7.831 profile:0(doit()) 
     0 0.000    0.000   profile:0(profiler)

來源

2014-02-05 13:08:49 johnshumon

Python - 單詞出現次數

回答

相關問題