那麼......這就是你問的。但是,它非常難看,而且你提供的例子非常具體。我懷疑它會對實際的數據文件失敗。
當面對這種解析工作時,解決問題的一種方法是通過一些初步清理來運行輸入數據,儘可能簡化和合理化文本。例如,處理不同類型的整數列表是煩人的,並使正則表達式更加複雜。如果你可以刪除不必要的整數之間的逗號和放棄終端「或 - 和」正則表達式可以更簡單。一旦完成了這種清理,有時您可以應用一個或多個正則表達式來提取所需的位。在某些情況下,不符合主要正則表達式的異常值的數量可以通過特定的查找或硬編碼的特殊情況規則來處理。
import re
lines = [
"if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
"if magic code is 4542 it is not valid in type.",
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number.",
]
mcs_rgx = re.compile(r'magic code is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
types_rgx = re.compile(r'types? is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
rest_rgx1 = re.compile(r'(type (has|must).+)')
rest_rgx2 = re.compile(r'.+\d(.+)')
nums_rgx = re.compile(r'\d+')
for line in lines:
m = mcs_rgx.search(line)
if m:
mcs_text = m.group(1)
mcs = map(int, nums_rgx.findall(mcs_text))
else:
mcs = []
m = types_rgx.search(line)
if m:
types_text = m.group(1)
types = map(int, nums_rgx.findall(types_text))
else:
types = []
m = rest_rgx1.search(line)
if m:
rest = [m.group(1)]
else:
m = rest_rgx2.search(line)
if m:
rest = [m.group(1)]
else:
rest = ['']
print mcs, types, rest
輸出:
[543] [5642, 912342, 7425] ['type has to have a period. EX: 02-15-99']
[722, 43, 643256] [43234, 5356, 2112] ['type has to start with period.']
[4542] [] [' it is not valid in type.']
[532] [43] ['type must begin with law number.']
來源
2016-11-24 03:14:20
FMc
什麼是與字符串的一部分,你正在尋找搶規則? – idjaw
我想要的結果在帖子中。 –
*文字有很多變化* - 看起來你想要自然語言識別,而不是正則表達式。如何 –