如何從python文件中除去空格以外的特殊字符？

我有一個巨大的文本語料庫（逐行），我想刪除特殊字符但支持字符串的空間和結構。如何從python文件中除去空格以外的特殊字符？

hello? there A-Z-R_T(,**), world, welcome to python. 
this **should? the next line#followed- [email protected] an#other %million^ %%like $this.

應該

hello there A Z R T world welcome to python 
this should be the next line followed by another million like this

來源

2017-04-12 pythonlearn

只需創建您想要的字符列表，AZ，az，0-9等。然後使用for循環遍歷字符串中的每個字符，用空格替換不在列表中的字符。 – Wright

對於數百萬行文本的巨大語料庫是否有效？ – pythonlearn

您可以使用這一模式，也與regex：

from re 
a = '''hello? there A-Z-R_T(,**), world, welcome to python. 
this **should? the next line#followed- [email protected] an#other %million^ %%like $this.''' 

for k in a.split("\n"): 
    print(re.sub(r"[^a-zA-Z0-9]+", ' ', k)) 
    # Or: 
    # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k)) 
    # print(final)

輸出：

hello there A Z R T world welcome to python 
this should the next line followed by an other million like this

編輯：

否則，您可以最終線存儲到list：

final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")] 
print(final)

輸出：

['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']

來源

2017-04-12 01:57:25

這是做的工作，但如何防止它把所有的行放在一個單一的很長的行？ – pythonlearn

我已經更新了我的答案。現在檢查。 –

我覺得NFN尼爾答案是偉大的......但我想補充一個簡單的正則表達式刪除所有沒有單詞的字符，但是它會考慮下劃線作爲單詞的一部分

print re.sub(r'\W+', ' ', string) 
>>> hello there A Z R_T world welcome to python

來源

2017-04-12 02:02:31 Eliethesaiyan

創建字典映射特殊字符ters to None

d = {c:None for c in special_characters}

使用字典進行translation table。將整個文本讀入一個變量，並在整個文本上使用str.translate。

來源

2017-04-12 03:08:40 wwii

如何從python文件中除去空格以外的特殊字符？

回答

相關問題