2016-11-24 185 views
1

我正在努力正確解析文本。文中有很多變化。理想情況下,我想用Python做到這一點,但任何語言都可以工作。正則表達式解析字符串

實施例的字符串:

  • "if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99"
  • "If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period."
  • "if magic code is 4542 it is not valid in type."
  • "if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."

結果,我想:

  • [543] [5642, 912342, 7425][type has to have a period.]
  • [722, 43, 643256][3234, 5356, and 2112][type has to start with period.]
  • [4542][it is not valid in type.]
  • [532][43][the type must begin with law number.]

還有其他一些變化,但你看到的概念。對不起,我對正則表達式不太擅長。

+0

什麼是與字符串的一部分,你正在尋找搶規則? – idjaw

+0

我想要的結果在帖子中。 –

+2

*文字有很多變化* - 看起來你想要自然語言識別,而不是正則表達式。如何 –

回答

1

那麼......這就是你問的。但是,它非常難看,而且你提供的例子非常具體。我懷疑它會對實際的數據文件失敗。

當面對這種解析工作時,解決問題的一種方法是通過一些初步清理來運行輸入數據,儘可能簡化和合理化文本。例如,處理不同類型的整數列表是煩人的,並使正則表達式更加複雜。如果你可以刪除不必要的整數之間的逗號和放棄終端「或 - 和」正則表達式可以更簡單。一旦完成了這種清理,有時您可以應用一個或多個正則表達式來提取所需的位。在某些情況下,不符合主要正則表達式的異常值的數量可以通過特定的查找或硬編碼的特殊情況規則來處理。

import re 

lines = [ 
    "if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99", 
    "If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.", 
    "if magic code is 4542 it is not valid in type.", 
    "if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number.", 
] 

mcs_rgx = re.compile(r'magic code is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE) 
types_rgx = re.compile(r'types? is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE) 
rest_rgx1 = re.compile(r'(type (has|must).+)') 
rest_rgx2 = re.compile(r'.+\d(.+)') 
nums_rgx = re.compile(r'\d+') 

for line in lines: 

    m = mcs_rgx.search(line) 
    if m: 
     mcs_text = m.group(1) 
     mcs = map(int, nums_rgx.findall(mcs_text)) 
    else: 
     mcs = [] 

    m = types_rgx.search(line) 
    if m: 
     types_text = m.group(1) 
     types = map(int, nums_rgx.findall(types_text)) 
    else: 
     types = [] 

    m = rest_rgx1.search(line) 
    if m: 
     rest = [m.group(1)] 
    else: 
     m = rest_rgx2.search(line) 
     if m: 
      rest = [m.group(1)] 
     else: 
      rest = [''] 

    print mcs, types, rest 

輸出:

[543] [5642, 912342, 7425] ['type has to have a period. EX: 02-15-99'] 
[722, 43, 643256] [43234, 5356, 2112] ['type has to start with period.'] 
[4542] [] [' it is not valid in type.'] 
[532] [43] ['type must begin with law number.'] 
+0

你是怎麼用正則表達式得到好處的?謝謝你的帖子。對於學習正則表達式的人來說,你有什麼建議嗎? –

+0

@Lukeallthingsspatial主要是練習。傑弗裏弗裏德爾的「掌握正則表達式」非常好,但也有很多其他在線教程。關注的一個關鍵問題是:目標不是學習正則表達式;相反,它是在文本解析方面變得熟練。這是一個更廣泛的努力,正則表達式只是其中的一個工具。使用文本解析,您需要關注您的槓桿點:文本的哪些方面是常見點(可以這麼說的「規則」),這將允許您將文本分解爲更簡單的部分。祝你好運! – FMc

0

這裏有一個正則表達式的解決方案加上一些在事後清理。這適用於你所有的例子,但正如評論中所述,如果你的句子變化比這更多,你應該探索正則表達式以外的選項。

import re 

sentences = ["if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99", 
      "If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.", 
      "if magic code is 4542 it is not valid in type.", 
      "if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."] 

pat = '(?i)^if\smagic\scode\sis\s(\d+(?:,?\s(?:\d+|or))*)(?:.*types?\sis\s(\d+(?:,?\s(?:\d+|or|and))*,)(.*\.)|(.*\.))' 

find_ints = lambda s: [int(d) for d in re.findall('\d+', s)] 

matches = [[g for g in re.match(pat,s).groups() if g] for s in sentences] 

results = [[find_ints(m) for m in match[:-1]]+[[match[-1].strip()]] for match in matches] 

如果你需要很好的印刷之類的東西在你的榜樣:

for r in results: 
    print(*r, sep='')