蟒蛇正則表達式的mediawiki段解析

我有類似下面的文本：蟒蛇正則表達式的mediawiki段解析

==Mainsection1== 
Some text here 
===Subsection1.1=== 
Other text here 

==Mainsection2== 
Text goes here 
===Subsecttion2.1=== 
Other text goes here.

在上面的文字的主區1和2有不同的名稱，可以是用戶想要的一切。小節也一樣。

我想要做的正則表達式是獲取mainsection的文本，包括其子節（如果有的話）。是的，這是從wikipage。所有mainsections名稱均以==開頭並以== 結尾。所有子部分的名稱均大於2==。

regex =re.compile('==(.*)==([^=]*)', re.MULTILINE) 
regex.findall(text)

但是，上述返回每個單獨的部分。這意味着它完美地返回一個主要部分，但看到他自己的一個小節。

我希望有人能幫助我這個作爲它的被竊聽我一些時間

編輯：結果應該是：

[('Mainsection1', 'Some text here\n===Subsection1.1=== 
Other text here\n'), ('Mainsection2', 'Text goes here\n===Subsecttion2.1=== 
Other text goes here.\n')]

編輯2：
我已經重寫我的代碼不使用正則表達式。我得出的結論是，我自己解析它很容易。這使我對它更具可讀性。

因此，這裏是我的代碼：

def createTokensFromText(text):  
    sections = [] 
    cur_section = None 
    cur_lines = [] 


    for line in text.split('\n'): 
     line = line.strip() 
     if line.startswith('==') and not line.startswith('==='): 
      if cur_section: 
       sections.append((cur_section, '\n'.join(cur_lines))) 
       cur_lines = [] 
      cur_section = line 
      continue 
     if cur_section: 
      cur_lines.append(line) 

    if cur_section: 
     sections.append((cur_section, '\n'.join(cur_lines))) 
    return sections

感謝大家的幫助！

所有提供的答案幫助了我很多！

來源

2011-10-17 Fox Mulder

也許你會更好使用預先存在的wikimedia標記分析器？乍一看，https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Alternative_parsers，mwlib看起來最有前途。 – slowdog

這對Regex來說不是一件好事。你最好使用真正的解析器（比如PLY或PyParsing），或者更好一些：其他人已經編寫過的庫。 – jathanism

這對於Regex來說可能不是很好，但它肯定是可行的 - 問題是您的特定語法與任何可用的wiki解析器有多接近 - 以及您偏離「標準」或至少受歡迎的原因語法 –

首先，應該知道，我知道一點關於Python的，但我從來沒有在正式它編程...鍵盤說這個作品，所以這裏去！：D - 抱歉，表情太複雜了：

(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))

這就是你要求的，我相信！鍵盤背後，this code：

import re 

wikiText = """==Mainsection1== 
Some text here 
===Subsection1.1=== 
Other text here 

==Mainsection2== 
Text goes here 
===Subsecttion2.1=== 
Other text goes here. """ 

outputArray = re.findall('(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))', wikiText) 
print outputArray

產生以下結果：

[('Mainsection1', '\nSome text here\n===Subsection1.1===\nOther text here\n\n'), ('Mainsection2', '\nText goes here\n===Subsecttion2.1===\nOther text goes here. ')]

編輯：看，表達基本上是說：

01 (?<!=)  # First, look behind to assert that there is not an equals sign 
02 ==   # Match two equals signs 
03 ([^=]+)  # Capture one or more characters that are not an equals sign 
04 ==   # Match two equals signs 
05 (?!=)   # Then verify that there are no equals signs following this 
06 (   # Start a capturing group 
07 [\s\S]*? # Match zero or more of ANY character (even CrLf), but BE LAZY 
08 (?=   # Look ahead to verify that either... 
09  $   #  this is the end of the 
10  |   #  -OR- 
11  (?<!=) #  when I look behind there is no equals sign 
12  ==  #  then there are two equals signs 
13  [^=]+  #  then one or more characters that are not equals signs 
14  ==  #  then two equals signs 
15  (?!=)  #  then verify that there are no equals signs following this 
16 )   # End look-ahead group 
17)    # End capturing group

03線和06行指定捕獲分別爲主要部分標題和主要部分內容的組。

07線乞求很多的解釋，如果你不是在正則表達式非常流利......

的\s和\S字符類[]將匹配任何空白是或不是空格（內即任何一個） - 一種替代方案是使用.運算符，但根據您的編譯器選項（或指定選項的能力），這可能會或可能不匹配CrLf（或回車返回/換行）。既然你想匹配多行，這是確保匹配的最簡單方法。
*?最後意味着它將匹配零或多個「任何」字符類的實例，但要懶惰關於它 - 「懶」量詞（有時稱爲「不情願」）與默認的「貪婪」相反「量詞（不包括後面的?），並且不會消耗源字符，除非跟在它後面的源無法與延遲量詞後面的表達式匹配。換句話說，這將消耗任何字符，直到它找到源文本的結尾或者另一個主節，它由一個或多個非等號的字符（包括兩個並且只有兩個等號）指定空格）。如果沒有懶惰操作者，它會嘗試消耗整個源文本，然後「回溯」，直到它可以匹配在表達後它的事情之一（源端或節頭）

08行是一個「前瞻」指定表達式應該是可匹配的，但不應該被消耗。

編輯完

據我所知，它是這個複雜的，爲了如果你想匹配節的名稱和內容節將命名組，你可以試試這個正確排除小節...：

(?<!=)==(?P<SectionName>[^=]+)==(?!=)(?P<SectionContent>[\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))

如果你願意，我可以爲你分解它！請問！編輯（請參閱上面的編輯）END編輯

來源

2011-10-17 16:54:31

這也非常感謝。你能分解我的正則表達式嗎？ –

非常感謝！非常乾淨的解釋 –

@Fox如果您在我的回答或其他答案中獲得幫助，請點擊答案頂部附近投票箭頭下方的勾號/勾號選擇最好的答案 - 它有助於在答案中鼓勵更多的好的和有用的答案。未來：D –

的這裏的問題是，==(.*)==比賽==(=Subsection=)==，所以要做的第一件事是要確保有標題內沒有=：==([^=]*)==([^=]*)。

然後我們需要確保在比賽開始之前沒有=，否則，忽略三者中的第一個=並且字幕匹配。這將做的訣竅：(?<!=)==([^=]*)==([^=]*)，這意味着「匹配，如果沒有......」。

我們也可以在最後做到這一點，以確保最終結果爲(?<!=)==([^=]*)==(?!=)([^=]*)。

>>> re.findall('(?<!=)==([^=]*)==(?!=)([^=]*)', x,re.MULTILINE) 
[('Mainsection1', '\nSome text here\n'), 
('Mainsection2', '\nText goes here\n')]

您也可以刪除標題末尾的檢查，並用換行符替換它。如果你確定每個標題末尾都有新的一行，這可能會更好。

>>> re.findall('(?<!=)==([^=]*)==\n([^=]*)', x,re.MULTILINE) 
[('Mainsection1', 'Some text here\n'), ('Mainsection2', 'Text goes here\n')]

編輯：

section = re.compile(r"(?<!=)==([^=]*)==(?!=)") 

result = [] 
mo = section.search(x) 
previous_end = 0 
previous_section = None 
while mo is not None: 
    start = mo.start() 
    if previous_section: 
     result.append((previous_section, x[previous_end:start])) 
    previous_section = mo.group(0) 
    previous_end = mo.end() 
    mo = section.search(x, previous_end) 
result.append((previous_section, x[previous_end:])) 
print result

它更簡單比它的外觀：反反覆覆，我們搜索一前一後一節的標題，並把它和之間的文本添加到結果這個標題的開頭和前一個的結尾。調整它以適應你的風格和你的需求。其結果是：

[('==Mainsection1==', 
    ' \nSome text here \n===Subsection1.1=== \nOther text here \n\n'), 
('==Mainsection2==', 
    ' \nText goes here \n===Subsecttion2.1=== \nOther text goes here. ')]

來源

2011-10-17 14:55:54 madjar

嗯，我們到達那裏，但我缺少的東西。我仍然需要從小節的文本，這樣的結果將是：[（'Mainsection1'，'這裏的一些文字\ n = == Subsection1.1 === 其他文字在這裏\ n'）到目前爲止感謝 –

恐怕我們已經達到了我的正則表達式知識的極限。我會知道的是使用我的正則表達式的第一部分（'（？<！=）==（[^ =] *）==（？！=）'）來檢測標題，在比賽之間。你想要我詳細說明這個想法，還是不是一種選擇？ – madjar

這可以工作，所以如果你想，那麼請繼續前進！迄今爲止感謝很多 –

蟒蛇正則表達式的mediawiki段解析

回答

相關問題