如何在解析python字符串時保留重複標點符號？

我需要處理少量文本（即python中的字符串）。如何在解析python字符串時保留重複標點符號？

我想刪除某些標點符號（如'.', ',', ':', ';',）

，但保持標點符號表示像（'...', '?', '??','???', '!', '!!', '!!!'）

也有七情六慾的，我想刪除無信息的詞作爲'a', 'an', 'the'。此外，到目前爲止最大的挑戰是如何解析「我有」或「我們有」最終得到「我有」和「我們有」？撇號使我感到困難。

什麼是最好/最簡單的方法來做到這一點在Python中？

例如：

"I've got an A mark!!! Such a relief... I should've partied more."

結果我想：

['I', 'have', 'got', 'A', 'mark', '!!!', 'Such', 'relief', '...', 

'I', 'should', 'have', 'partied', 'more']

來源

2016-02-12 Oleksandra

運行你試過* *什麼做到這一點？ –

是的！我已經嘗試了幾個正則表達式，但是我要實現一個或另一個目標，而不是全部。 – Oleksandra

然後發佈他們並解釋什麼是錯的，也許有人可以幫助解決它們。 –

這可能會變得複雜，這取決於你想多少規則適用。

您可以在正則表達式中使用\b來匹配單詞的開始或結尾。有了這個功能，您還可以隔離標點並檢查它們是否爲列表中的單個字符，例如[.;:]。

這些想法在這段代碼中使用：

import re 

def tokenise(txt): 
    # Expand "'ve" 
    txt = re.sub(r"(?i)(\w)'ve\b", r'\1 have', txt) 
    # Separate punctuation from words 
    txt = re.sub(r'\b', ' ', txt) 
    # Remove isolated, single-character punctuation, 
    # and articles (a, an, the) 
    txt = re.sub(r'(^|\s)([.;:]|[Aa]n|a|[Tt]he)($|\s)', r'\1\3', txt)  
    # Split into non-empty strings 
    return filter(bool, re.split(r'\s+', txt)) 

# Example use 
txt = "I've got an A mark!!! Such a relief... I should've partied more." 
words = tokenise(txt) 
print (','.join(words))

輸出：

我，有，有，A，標誌，!!!，這樣，浮雕，...，I ，應該有，了宴會，更

看到它在eval.in

來源

2016-02-12 20:43:00 trincot

如何在解析python字符串時保留重複標點符號？

回答

相關問題