我一直在教自己在Python中的正則表達式，我決定打印出文本的所有句子。過去3個小時，我一直在修改正則表達式，無濟於事。正則表達式來查找文本的所有句子？

我只是試過以下，但什麼都不能做。

p = open('anan.txt') 
process = p.read() 
regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I) 
print regexMatch 
p.close()

我的輸入文件是這樣的：

OMG is this a question ! Is this a sentence ? My. 
name is.

這將打印沒有輸出。但是當我刪除「我的名字是。」時，它會打印出OMG，這是一個問題，這是一個句子，就像它只讀第一行一樣。

正則表達式的最佳解決方案是什麼，可以找到文本文件中的所有句子 - 無論句子是否帶有新行 - 還可以讀取整個文本？謝謝。

來源

2010-08-23 sarevok

也許這可以幫助：http://stackoverflow.com/questions/587345/python-regular-expression-matching-a-multiline-block-of-text – Arslan 2010-08-23 15:46:09

我不能相信沒有人用這種語言插話：可靠的句子邊界檢測definitley不可能用正則表達式。即使使用諸如自然語言工具包的ntlk.tokenizer.sent_tokenize（http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize-module.html）等複雜工具也是如此。 – twneale 2010-08-23 16:38:26

事情是這樣工作的：

## pattern: Upercase, then anything that is not in (.!?), then one of them 
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M) 
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.') 
['OMG is this a question !', 'Is this a sentence ?', 'My.']

注意如何name is.不在結果，因爲它不是以大寫字母開頭。

你的問題來自使用^$錨，他們工作在整個文本。

來源

2010-08-23 15:38:51

非常感謝。因爲我必須處理txt文件，所以我將它改編爲re.findall。有沒有辦法阻止'\ n'字符出現在結果中？我的意思是，在換句話來說，在不同的詞語之間出現了\ n。 – sarevok 2010-08-23 16:07:42

@sarevok：你可以在用'text.replace（'\ n'，''）分割之前刪除\ n。 – 2010-08-23 17:20:14

再次感謝:) – sarevok 2010-08-26 12:04:30

我試圖在記事本+ +，我得到這個：

.*$

並激活多選項：

re.MULTILINE

乾杯

來源

2010-08-23 15:37:43 Arslan

嘗試其他方式：拆分文本在句子邊界。

lines = re.split(r'\s*[!?.]\s*', text)

如果不工作，.之前添加\。

來源

2010-08-23 15:38:31

，我們在您的正則表達式的兩個問題：

你的表達是通過anchored和^$，這是分別爲「行首」和主持人「行結束」。這意味着您的模式正在尋找匹配您的文本的整個行。
您在標點符號前面搜索\s+，該符號指定one or more whitespace character。如果標點符號前沒有空格，則表達式不匹配。

來源

2010-08-23 15:39:08

Upvoted實際解釋問題的兩件事情，而不只是發出一個固定的正則表達式。 – cincodenada 2010-08-23 15:58:44

編輯：現在它也可以使用多行句子。

>>> t = "OMG is this a question ! Is this a sentence ? My\n name is." 
>>> re.findall("[A-Z].*?[\.!?]", t, re.MULTILINE | re.DOTALL) 
['OMG is this a question !', 'Is this a sentence ?', 'My\n name is.']

只有一件事留給解釋 - re.DOTALL使得.匹配換行符描述here

來源

2010-08-23 15:39:39 cji

你可以試試：

p = open('a') 
process = p.read() 
print process 
regexMatch = re.findall('[^.!?]+[.!?]',process) 
print regexMatch 
p.close()

這裏使用的正則表達式是[^.!?]+[.!?]它試圖匹配一個或多個非句子分隔符，後跟句子分隔符。

來源

2010-08-23 15:40:43 codaddict

謝謝cji和Jochen Ritzel。

sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL)

我覺得這是最好的，只是在最後加一個空格。

SampleReport='I image from 08/25 through 12. The patient image 1.2, 23, 34, 45 and 64 from serise 34. image look good to have a tumor in this area. It has been resected during the interval between scans. The'

如果使用

pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M) 
pat.findall(SampleReport)

結果將是：

['I image from 08/25 through 12.', 
'The patient image 1.', 
'It has been resected during the interval between scans.']

的缺陷是它不能處理數字像1.2。但這個完美的作品。

sentence.findall(SampleReport)

結果

['I image from 08/25 through 12. ', 
'The patient image 1.2, 23, 34, 45 and 64 from serise 34. ', 
'It has been resected during the interval between scans. ']

來源

2018-03-07 20:51:25

正則表達式來查找文本的所有句子？

回答

結果

相關問題