蟒蛇刪除怪異撇號和其他怪異字符無法在string.punctuation

這是我的字符串：蟒蛇刪除怪異撇號和其他怪異字符無法在string.punctuation

mystring = "How’s it going?"

這是我做過什麼：

import string 
exclude = set(string.punctuation) 

def strip_punctuations(mystring): 
    for c in string.punctuation: 
     new_string=''.join(ch for ch in mystring if ch not in exclude) 
     new_string = chat_string.replace("\xe2\x80\x99","") 
     new_string = chat_string.replace("\xc2\xa0\xc2\xa0","") 
    return chat_string

OUTPUT：

如果我沒有包括這一行new_string = chat_string.replace("\xe2\x80\x99","")這將是輸出：

'How\xe2\x80\x99s it going'

我意識到排除沒有在列表中怪異的撇號：

print set(exclude) 
set(['!', '#', '"', '%', '$', "'", '&', ')', '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', '<', '?', '>', '@', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~'])

如何確保所有這些字符都取出來，而不是手動在未來替代它們？

來源

2016-06-21 jxn

Python 2，我假設？ –

yep python 2.7。 – jxn

您不應該將字符串作爲utf8字符串。先解碼它們。 – Daniel

如果您正在處理新聞文章或網絡報廢等長文本，那麼您可以使用「goose」或「NLTK」python庫。這兩個不是預先安裝的。這裏是圖書館的鏈接。 goose，NLTK

您可以瀏覽文檔並瞭解如何操作。

，如果你不想使用這些庫，您可能需要手動創建自己的「排除」列表中。

來源

2016-06-21 17:19:11

import re 

toReplace = "how's it going?" 
regex = re.compile('[!#%$\"&)\'(+*-/.;:=<?>@\[\]_^`\{\}|~"\\\\"]') 
newVal = regex.sub('', toReplace) 
print(newVal)

正則表達式匹配您設置的所有字符，並用空白替換它們。

來源

2016-06-21 17:32:04 Brunaldo

蟒蛇刪除怪異撇號和其他怪異字符無法在string.punctuation

回答

相關問題