刪除python中的長字符串中的某些字符

我正在處理一個項目，該項目涉及到一些源代碼並將其煮爲僅顯示在頁面上的單詞。我可以讓它去除所有的html標籤，以及腳本標籤之間的所有東西，但我無法弄清楚如何刪除所有以反斜槓開頭的字符。一個頁面將包含\ t，\ n和\ x **，其中*似乎是任何小寫字母或數字。刪除python中的長字符串中的某些字符

我將如何編寫一個代碼，用空格替換所有這些部分的字符串？我在Python中工作。

例如，這是從網頁的字符串：

\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0

將成爲：

Apple - Wikipedia, the free encyclopedia Language:English sAsturianuAz rbaycanca Basa Banyumasan

來源

2012-06-09 fnsjdnfksjdb

你能解釋一下嗎？ – varunl

發佈一個具有期望輸出的簡短示例 –

如果它特別是您感興趣的維基百科內容，最好使用維基百科提供的數據庫轉儲：https：//en.wikipedia.org/wiki/Wikipedia:Database_download –

s = repr('''\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0''') 
s = re.sub(r'\\[tn]', '', s) 
s = re.sub(r'\\x..', '', s) 
print s

來源

2012-06-09 20:07:30

爲什麼你想知道在這種情況下替代的數量？這是sub和subn的唯一區別 –

這個看起來很接近，但是當我輸入時，字符串在r'\\ [tn]'之後不會關閉，並且r也會變爲字符串的顏色。 – fnsjdnfksjdb

寫regex以匹配所有所需patters，然後用一個空格替換它們。

來源

2012-06-09 19:56:53 varunl

鑑於純文本的話包含至少三個字符：

' '.join(re.findall(r'\w{3,}', s)) # where s represents the string

或者：

' '.join(re.findall(r'(?:\w{3,}|-(?=\s))', s)) # in order to preserve the dash char

來源

2012-06-09 20:25:47 Vidul

維基百科使用UTF-8編碼字符串。要轉換爲純ASCII，則必須

從UTF-8轉換爲Unicode
從Unicode到ASCII轉換，以更換uncodable字符
轉換uncodable-字符替換爲空格
轉換多個空格（製表符，換行符等），以單個空格
條的前緣和後空格

。

s = "\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba" 

import re 
whitespaces = re.compile('\s+', flags=re.M) 
def utf8_to_ascii(s, ws=whitespaces): 
    s = s.encode("utf8") 
    s = s.decode("ascii", errors="replace") 
    s = s.replace(u"\ufffd", " ") 
    s = ws.sub(" ", s) 
    return s.strip() 

s = utf8_to_ascii(s)

這最終導致串

Apple - Wikipedia, the free encyclopedia Language:English Aragon sAsturianuAz rbaycanca B n-l m-g Basa Banyumasan

來源

2012-06-09 21:08:34

假設默認ASCII編碼中，我們可以在一個行，沒有惡，正則表達式做到這一點相當不錯;），通過遍歷字符串，刪除基於使用ord(i) < 128它們的編碼值，或任何規範值，我們選擇：

>>> ' '.join(''.join([i if ord(i) < 128 else ' ' for i in mystring]).split()) 
#Output: 
Apple - Wikipedia, the free encyclopedia Language:English Aragon sAsturianuAz rbaycanca B n-l m-g Basa Banyumasan

，或者我們可以指定允許的字符和「在」使用的字符串，像這樣使用內置 string.ascii_letters：

>>> import string 
>>> ' '.join(''.join([i if i in string.ascii_letters else ' ' for i in mystring]).split()) 
#Output: 
Apple Wikipedia the free encyclopedia Language English Aragon sAsturianuAz rbaycanca B n l m g Basa Banyumasan

這也將刪除標點符號（但我們可以很容易地避免通過添加這些字符回，如果我們的字符串檢查定義想要，check = string.ascii_letters + ',.-:'）

來源

2012-06-10 09:37:13 fraxel

刪除python中的長字符串中的某些字符

回答

相關問題