用Python刪除停用詞 - 快速有效

我有大約6米的文檔，其中每個文檔都有相當大的停用詞集以從每個文檔中刪除。用Python刪除停用詞 - 快速有效

我學到的技巧是通過使用re編譯模式來刪除這些技巧。但是現在我得到一個OverflowError。

我處理我的禁用詞如下：

states_string =r'\b(' + '|'.join(states) + r')\b' 
states_pattern = re.compile(states_string)

states顯然是字符串，如[ 'NY'， 'CA'，...] <的列表 - 不能粘貼這些都歸結爲將遠遠超過一個職位的限制！

我得到的錯誤是：OverflowError: regular expression code size limit exceeded。

顯然我的字符串然後編譯模式太長了。

有沒有人有任何建議，如何處理這個，或另一種方法。

我知道的一個是：[word for word in words if not word in stopwords]但這遍歷每個單詞，所以不理想。

請注意，禁用詞的長度爲2500

來源

2014-05-05 redrubia

你能提供一個簡單的例子嗎？ –

我可以提供一個狀態的例子，但沒有足夠長的時間來顯示我正在使用的停用詞的數量。嘗試在這裏：http://stackoverflow.com/questions/1998261/pythons-regular-expression-source-string-length創建此錯誤的方式。將所有停用詞放在一起將會太長！ – redrubia

檔案有多大？ – dawg

至於我看到它，你有3個選擇 - 分割成更小的正則表達式，使用像一條巨蟒組，或掏出（到awk或者sed）。假設你有一個文檔充滿了單詞和一系列停用詞，並且你需要一個不同的單詞文檔 - 停用詞。

正則表達式：

stopwords_regex_list = [] 
chunk_size = 100 # can tweak depending on size 
for i in xrange(0, len(stopwords), chunk_size): 
    stopwords_slice = stopwords[i:i + chunk_size] 
    stopwords_regex_list.append(re.compile('\b(' + '|'.join(stopwords_slice) + ')\b')) 
    with open('document') as doc: 
     words = doc.read() # can read only a certain size if the files are massive 
    with open('regex_document', 'w') as regex_doc: 
     for regex in stopwords_regex_list: 
      words = regex.sub('', words) 
     regex_doc.write(words)

集：

stopwords_set = set(stopwords) 
with open('document') as doc: 
    words = doc.read() 
    with open('set_document', 'w') as set_doc: 
     for word in words.split(' '): 
      if not word in stopwords_set: 
       set_doc.write(word + ' ')

桑達：

with open('document') as doc: 
    with open('sed_script', 'w') as sed_script: 
     sed_script.writelines(['s/\<{}\>//g\n'.format(word) for word in stopwords]) 
    with open('sed_document', 'w') as sed_doc: 
     subprocess.call(['sed', '-f', 'sed_script'], stdout=sed_doc, stdin=doc)

我不是一個sed專家，所以有可能是一個更好的辦法來做到這一點比。您可能需要對每種方法進行編碼，並查看哪種方法最適合您。

來源

2014-05-06 04:55:10

感謝您的回答。當我以單詞的頻率看結尾時，我最終做了一些略微不同的事情，我從文本列表中的nltk中調用了FreqDist，然後刪除了這些詞，並將其視爲詞典中的停用詞。 FreqDist速度相當快，在創建後刪除它意味着我不必檢查停用詞的長列表中的每個單詞。但是，由於我經常刪除停用詞，因此您的建議非常棒 – redrubia

我已經跑了以下，而且工作得很好：

>>> states = ['AL', 'AK', 'AS', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FM', 'FL', 'GA', 'GU', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MH', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'MP', 'OH', 'OK', 'OR', 'PW', 'PA', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VI', 'VA', 'WA', 'WV', 'WI', 'WY', 'AE', 'AA', 'AP'] 
>>> states_string = r'\b(' + '|'.join(states) + r')\b' 
>>> states_pattern = re.compile(states_string) 
>>> states_pattern 
<_sre.SRE_Pattern object at 0x00000000034D3C40>

這是我能和你以前做的信息做到最好給出。請在你的問題中發佈整個數組，否則我們無法知道你是否已經使用除了這個50-statecode數組以外的任何東西來生成列表。

PS：信貸到期時的信用：我在此使用的數組主要基於this gist comment。

來源

2014-05-06 01:05:09 Joeytje50

嘗試創建整個陣列的要點，但它太大了，它不會發布它！ – redrubia

只是試圖將其添加到帖子，沒有運氣！仍然太大 – redrubia

@redrubia我想我可以得到你正在使用的數組大小的印象。我認爲Raymond Hettinger的回答已經很好地涵蓋了你的問題。感謝至少嘗試。 – Joeytje50

這似乎是Python的正則表達式引擎執行的硬性限制：

~/py27 $ ack -C3 'regular expression code size' 
Modules/_sre.c 
2756-  if (value == (unsigned long)-1 && PyErr_Occurred()) { 
2757-   if (PyErr_ExceptionMatches(PyExc_OverflowError)) { 
2758-    PyErr_SetString(PyExc_OverflowError, 
2759:        "regular expression code size limit exceeded"); 
2760-   } 
2761-   break; 
2762-  } 
2763-  self->code[i] = (SRE_CODE) value; 
2764-  if ((unsigned long) self->code[i] != value) { 
2765-   PyErr_SetString(PyExc_OverflowError, 
2766:       "regular expression code size limit exceeded"); 
2767-   break; 
2768-  } 
2769- }

要解決的限制，你可能需要一個備用發動機。我推薦使用Python來生成sed腳本。這裏有一個粗略的想法，以幫助您開始：

stopwords = ''' 
the an of by 
for but is why'''.split() 

print '#!/bin/sed -f' 
for word in stopwords: 
    print '/%s/ d' % word

來源

2014-05-06 01:10:56

用Python刪除停用詞 - 快速有效

回答

相關問題