我正在嘗試從幾篇文章中提取日期。當我測試正則表達式時,模式只匹配部分感興趣的信息。正如你可以看到: https://regex101.com/r/ATgIeZ/2正則表達式|從文本中提取日期
這是文本文件的樣本:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] 3004
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo JULY 14, 2034
提取模式,我使用和代碼是這一個:
import re
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
pattern = ("[A-Z]+\.*\s(\d+)\,\s(\d+){4}")
result = re.findall(pattern,text_read)
print(result)
而來自Anaconda的輸出是:
[('5', '6'), ('7', '5'), ('1', '6'), .....]
預期的輸出是:
OCT. 5, 2016, FEB. 8, 2016, JULY 14, 2034 .....
圓括號之間的組只匹配數字。什麼是預期的輸出(也是,你的正則表達式在regextester是不同的) –