正則表達式建設取得從文本的句子 - Python的

一句話將是字符序列：正則表達式建設取得從文本的句子 - Python的

被終止（但不包括）字符！？。或文件
結束排除在兩端的空白，並
不是空

我有一個包含以下文本的文件：

這是\ n先句子。是不是\尼特？是的！！這\ n \ n上次位:)也是一個句子，但\ nwithout比檔案\ n

通過上述定義的結尾以外的終止，也有其四「的句子」：

句子1：this is the\nfirst sentence
句子2：Isn't\nit
句子3：Yes
句4：This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file

注意：

的句子不包括他們的終止符。
最後一句話沒有被一個字符終止;它會在文件結束時結束。
句子可以跨越文件的多行。

這就是我目前的(.*\n+)，不知道如何改進它。

請幫助我解析上面的文本並返回一個列表的正則表達式。提前感謝您的幫助。

來源

2017-03-07 Tunji

東西像https://regex101.com/r/dXXyTt/2 –

你需要使用正則表達式嗎？ 'nltk'內置了一個可靠的句子標記器。 –

我今天才剛剛讀過關於nltk的信息，對我來說是新的。我會更多地研究它，但是正則表達式現在就會做。感謝Wiktor，它的工作原理是 – Tunji

以下內容不適用於所有人，但適用於您的特定輸入。您可以進一步調整此表達式：

([^!?.]+)[!?.\s]*(?![!?.])

請參閱regex demo。

詳細：

([^!?.]+) - 捕獲組1匹配1個或多個字符以外!，?，.
[!?.\s]* - 0以上!，?，.，空格
(?![!?.]) - 沒有跟着!,?或.。

在Python中，你需要用re.findall使用它只會獲取與捕獲組捕獲的字符串：

import re 
rx = r"([^!?.]+)[!?.\s]*(?![!?.])" 
s = "this is the\nfirst sentence. Isn't\nit? Yes ! !! This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n" 
sents = re.findall(rx, s) 
print(sents) 
# => ['this is the\nfirst sentence', 
     "Isn't\nit", 
     'Yes ', 
     'This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n' 
    ]

見Python demo

來源

2017-03-07 19:45:39

試試這個：

re.split('(\!\s\!+)|\.|\?',s) 
['this is the\nfirst sentence', " Isn't\nit", ' Yes ', ' This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n']

來源

2017-03-07 20:04:50

嗨，我跑了它回來了七句話 – Tunji

正則表達式建設取得從文本的句子 - Python的

回答

相關問題