使用正則表達式匹配位置| Python

我從網站上颳了幾篇文章。現在我正試圖提取新聞的位置。這個位置用大寫字母表示，只用國家的首都（例如「BRUSSELS-」）或者在某些情況下與國家一起（例如「BRUSELLS，Belgium-」）使用正則表達式匹配位置| Python

這是一篇文章樣本：

|[<p>Advertisement , By MILAN SCHREUER and  ALISSA J. RUBIN OCT. 5, 2016 
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] 
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016 
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Monday and wounded]

正則表達式我用的是這個：

text_open = open("Training_News_6.csv") 
text_read = text_open.read() 
pattern = ("[A-Z]{1,}\w+\s\—") 
result = re.findall(pattern,text_read) 
print(result)

爲什麼我使用的計分標誌的原因（ - ），因爲是鏈接到該位置的重複的樣式。

然而，這正則表達式管理提取「布魯塞爾 - 」但是，當談到「阿富汗喀布爾 - 」它只能提取的最後部分，即「阿富汗 - 」。在第二種情況下，我想提取整個位置：首都和國家。任何想法？

來源

2016-11-24 M.Huntz

嘗試'r'（[A-Z] +）（？：\ W + \ w +）？\ s * - ''。見https://regex101.com/r/ATgIeZ/1 –

當我運行它時，它只匹配大寫字母，當首都後面跟着一個逗號和國家我想要extrac –

只需移動' ）'稍遠一點：https://regex101.com/r/ATgIeZ/2 –

您可以使用

([A-Z]+(?:\W+\w+)?)\s*—

見regex demo

詳細：

([A-Z]+(?:\W+\w+)?) - 捕獲組1（其內容將被返回的re.findall結果）捕獲
- [A-Z]+ - 1以上ASCII大寫字母
- (?:\W+\w+)? - 1或1+非字字符0次出現（由於?量詞）（\W+）和1+字字符（\w+）
\s* - 0+空格
— - 一個—符號

Python demo：

import re 
rx = r"([A-Z]+(?:\W+\w+)?)\s*—" 
s = "|[<p>Advertisement , By MILAN SCHREUER and  ALISSA J. RUBIN OCT. 5, 2016 \n, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] \n[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016 \n, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo" 
print(re.findall(rx, s)) # => ['BRUSSELS', 'KABUL, Afghanistan']

來源

2016-11-24 21:39:50

你可以做的一件事是將,和\s添加到你的第一個字符選擇，然後從左側剝離所有空格和逗號。 ,[A-Z,\s]{1,}\w+\s\— 甚至更簡單一些，像這樣：,(.+)\—。 $1將是你的匹配，包含額外的符號。另一個可能的選項：,\s*([A-Za-z]*[,\s]*[A-Za-z]*)\s\—或簡化版本：,\s*([A-Za-z,\s]*)\s\—。再次$1是你的匹配。

來源

2016-11-24 21:33:15

使用正則表達式匹配位置| Python

回答

相關問題