如何使用rstrip去除尾隨字符？

我通過了一堆我必須把每個單詞在該文檔列表文件試圖循環。我正在這樣做。 stoplist只是一個，我想用默認值忽略的單詞列表。如何使用rstrip去除尾隨字符？

texts = [[word for word in document.lower().split() if word not in stoplist] 
     for document in documents]

我處理文檔的列表中返回，並在每個名單中，是詞的列表。一些詞語仍然包含尾隨的標點符號或其他異常。我想我能做到這一點，但它似乎並沒有被正確

texts = [[word.rstrip() for word in document.lower().split() if word not in stoplist] 
     for document in documents]

或者

texts = [[word.rstrip('.,:!?:') for word in document.lower().split() if word not in stoplist] 
     for document in documents]

工作我的另一個問題是這樣的。我可能會看到像這樣的詞，我想保留這個詞，但轉儲尾數/特殊字符。

agency[15] 
assignment[72], 
you&#8217;ll 
america&#8217;s

所以要清理你大多數的其他噪聲的，我想我應該保持從字符串的結尾刪除字符，直到它的A-ZA-Z或者如果有比一個字母字符多個特殊字符串，折騰它。你可以在我的最後兩個例子中看到，字符串的末尾是一個字母字符。因此，在這種情況下，我應該忽略，因爲特殊字符（超過阿爾法字符）的量的話。我想我應該只搜索字符串的末尾，因爲如果可能的話，我想保留連字符的單詞。

基本上我想取消對每個字的所有末尾的標點，以及可能處理我剛纔所描述的情況下的子程序。我不知道該怎麼做，或者如果是最好的方法。

來源

2010-10-14 Nathan

>>> a = ['agency[15]','assignment72,','you&#8217;11','america&#8217;s'] 
>>> import re 
>>> b = re.compile('\w+') 
>>> for item in a: 
...  print b.search(item).group(0) 
... 
agency 
assignment72 
you 
america 
>>> b = re.compile('[a-z]+') 
>>> for item in a: 
...  print b.search(item).group(0) 
... 
agency 
assignment 
you 
america 
>>>

更新

>>> a = "I-have-hyphens-yo!" 
>>> re.findall('[a-z]+',a) 
['have', 'hyphens', 'yo'] 
>>> re.findall('[a-z-]+',a) 
['-have-hyphens-yo'] 
>>> re.findall('[a-zA-Z-]+',a) 
['I-have-hyphens-yo'] 
>>> re.findall('\w+',a) 
['I', 'have', 'hyphens', 'yo'] 
>>>

來源

2010-10-14 20:43:58 Robus

那些有連字符的單詞呢？如果可能的話，我想保留這些文字。例子可能是自定進度，反情報等 – Nathan 2010-10-14 20:51:09

以大寫字母更新/連字符 – Robus 2010-10-14 20:55:56

這完美的作品，謝謝！ – Nathan 2010-10-14 21:02:31

也許嘗試re.findall相反，與像[a-z]+模式：

import re 
word_re = re.compile(r'[a-z]+') 
texts = [[match.group(0) for match in word_re.finditer(document.lower()) if match.group(0) not in stoplist] 
      for document in documents] 

texts = [[word for word in word_re.findall(document.lower()) if word not in stoplist] 
      for document in documents]

然後，您可以輕鬆地調整你的正則表達式來獲得你想要的話。備用版本使用re.split：

import re 
word_re = re.compile(r'[^a-z]+') 
texts = [[word for word in word_re.split(document.lower()) if word and word not in stoplist] 
      for document in documents]

來源

2010-10-14 20:37:34

我在第一個「AttributeError的： '海峽' 對象有沒有屬性 '組'」得到了一個錯誤，「UnboundLocalError：局部變量'在分配之前引用'字''在你的第二個例子。 – Nathan 2010-10-14 20:54:41

對不起，我糾正了例子，他們應該罰款現在運行。 – 2010-10-15 15:57:41

如何使用rstrip去除尾隨字符？

回答

相關問題