2014-03-29 71 views
0

我想從文本中刪除所有單字符單詞。Python單字符清理

例如:我想清除下面文本中的所有粗體字符。 (a,?,d,*等),重新調整清理後的文本。

Lorem存有簡直是一個虛擬正文|的印刷和排版行業。 Lorem Ipsum自從16世紀以來一直是業界標準的虛擬文本,當時一臺未知的打印機採用了一種類型的廚房,並對其進行了加擾d使*型樣本書。它不僅存活了五個世紀,而且還進入了電子排版的大躍進,基本保持不變。

+1

什麼之前或之後標點字符? '結束句子.a開始新的'?角色周圍的空白會發生什麼? –

+0

前後所有的一個長度字符都有空格 –

+0

但是當你刪除*一個字符時,它周圍的空白字符是否也應該被刪除? –

回答

1

使用正則表達式:

re.sub(r'((?:^|(?<=\s))\S\s|\s\S(?:$|(?=\s)))', '', inputtext) 

這消除任何一個非空白字符,或者在文本的開始或用空白前面,後面是一個空白字符(這是刪除),一個空格字符後跟一個非空白字符,可以在文本的末尾或後面跟空白。

這樣可以確保一個字符周圍的空白也被正確刪除。

演示:

>>> import re 
>>> inputtext = '''\ 
... Lorem Ipsum is simply a dummy ? text | of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it d to make * type specimen book. It has survived not only five centuries, but also the leap into [ electronic typesetting, remaining essentially unchanged. 
... ''' 
>>> re.sub(r'((?:^|(?<=\s))\S\s|\s\S(?:$|(?=\s)))', '', inputtext) 
"Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took galley of type and scrambled it to make type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.\n" 
+0

謝謝Martijn Pieters。這解決了我的問題。 –