如何提取兩個不同比賽之間的文字？

我有了套文本的文本文件，我需要提取看起來像如下：如何提取兩個不同比賽之間的文字？

ITEM A blah blah blah ITEM B bloo bloo bloo ITEM A blee blee blee ITEM B

這裏是工作的代碼我到目前爲止：

finda = r'(Item\sA)' 
findb = r'(Item\sB)' 
match_a = re.finditer(finda, usefile, 2) # the "2" is a flag to say ignore case 
match_b = re.finditer(findb, usefile, 2)

我知道，我可以使用span，start和end等命令來查找匹配的文本位置。但我需要做很多次所以我需要的是：

開始寫在項目A，並停止在B項議題
寫如果第一次迭代少於50個字符，然後丟棄和移動下一個
一旦你找到了一組與項目A開始，以項目B結束，大於50個字符寫入到一個文件

由於一噸提前！我一直在旋轉我的輪子。

來源

2010-06-22 dandyjuan

爲什麼不乾脆：

with open(fname, 'w') as file: 
    for match in re.finditer(r'Item A(.+?)Item B', subject, re.I): 
     s = match.group(1) 
     if len(s) > 50: 
      file.write(s)

注：標誌的使用實際數值是re標誌提供，而斜，使用。

來源

2010-06-22 17:35:32 SilentGhost

您應該使用先行斷言爲最終定界符允許開始和結束分隔符的重疊。 – Gumbo 2010-06-22 17:46:18

謝謝！一旦我明白了這一切意味着什麼，我才能使其工作。 – dandyjuan 2010-06-22 18:25:16

這可以在一個單一的正則表達式來完成：

with open("output.txt", "w") as f: 
    for match in re.finditer(r"(?<=Item\sA)(?:(?!Item\sB).){50,}(?=Item\sB)", subject, re.I): 
     f.write(match.group()+"\n")

它匹配的是項目A和項目B之間還是你想匹配的分隔符，太？

正則表達式解釋說：

(?<=Item\sA) # assert that we start our match right after "Item A" 
(?:   # start repeated group (non-capturing) 
    (?!Item\sB) # assert that we're not running into "Item B" 
    .   # then match any character 
){50,}   # repeat this at least 50 times 
(?=Item\sB) # then assert that "Item B" follows next (without making it part of the match)

來源

2010-06-22 17:36:22

這是很棒的代碼，但它很複雜，很難弄清楚。 – vy32 2010-06-22 17:40:34

@ vy32：我同意，我提供了一個自由空間版本的正則表達式來更好地解釋它。 – 2010-06-22 17:45:27

如何提取兩個不同比賽之間的文字？

回答

相關問題