Python的正則表達式 - 從org-模式文件

得到物品，我有以下org-模式語法：Python的正則表達式 - 從org-模式文件

** Hardware [0/1] 
- [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6] 
- [X] Introduction to Networking - Charles Severance 
- [ ] A Tour of C++ - Bjarne Stroustrup 
- [ ] C++ How to Program - Paul Deitel 
- [X] Computer Systems - Randal Bryant 
- [ ] The C programming language - Brian Kernighan 
- [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2

，我想提取的物品，如：

getitems "Hardware"

我應該得到：

- [ ] adapt a programmable motor to a tripod to be used for panning

如果我要「讀 - 健康」，我應該得到：

- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2

我現在用的是以下模式：

pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL)

詢問時輸出「讀 - 技術」是：

- [X] Introduction to Networking - Charles Severance 
    - [ ] A Tour of C++ - Bjarne Stroustrup 
    - [ ] C++ How to Program - Paul Deitel 
    - [X] Computer Systems - Randal Bryant 
    - [ ] The C programming language - Brian Kernighan 
    - [ ] Beginning Linux Programming -Matthew and Stones 
    ** Reading - Health [3/4] 
    - [ ] Patrick McKeown - The Oxygen Advantage 
    - [X] Total Knee Health - Martin Koban 
    - [X] Supple Leopard - Kelly Starrett 
    - [X] Convict Conditioning 1 and 2

我也試過：

pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL)

這除了最後一個之外，最後一個工作正常。

輸出要求時，「讀 - 健康」：

- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett

正如你所看到的，它不會在最後一行匹配。

我使用python 2.7，並findall。

來源

2017-03-01 daleonpz

'\ * \ *閱讀 - 健康（*？）（？：\ * \ *。 | $）' – JazZ

您可以用實現它

import re 

string = """ 
** Hardware [0/1] 
- [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6] 
- [X] Introduction to Networking - Charles Severance 
- [ ] A Tour of C++ - Bjarne Stroustrup 
- [ ] C++ How to Program - Paul Deitel 
- [X] Computer Systems - Randal Bryant 
- [ ] The C programming language - Brian Kernighan 
- [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2 
""" 

def getitems(section): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
     items = rx.search(string) 
     return items.group('block') 
    except: 
     return None 

items = getitems('Reading - Technology') 
print(items)

查看working on ideone.com。

代碼的心臟是（濃縮）的表達：

^\*{2}.+[\n\r]  # match the beginning of the line, followed by two stars, anything else in between and a newline 
(?P<block>   # open group "block" 
    (?:    # non-capturing group 
     (?!^\*{2}) # a neg. lookahead, making sure no ** follows at the beginning of a line 
     [\s\S]  # any character... 
    )+    # ...at least once 
)     # close group "block"

您的搜索字符串**後的實際代碼插入。請參閱Reading - Technology的演示regex101.com。

作爲後續，你還可只返回 選擇的值，就像這樣：

def getitems(section, selected=None): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
     items = rx.search(string).group('block') 
     if selected: 
      rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE) 
      try: 
       selected_items = rxi.findall(items) 
       return selected_items 
      except: 
       return None 
     return items 
    except: 
     return None 

items = getitems('Reading - Health', selected=True) 
print(items)

來源

2017-03-01 22:01:55 Jan

謝謝，它改進了整體代碼..也regex101.com是一個很棒的網站 – daleonpz

@daleonpz：增加了一個版本，只返回選定的值。 – Jan

不確定你需要整個比賽的正則表達式。我只是使用正則表達式來匹配**行，然後返回行，直到看到下一個**行。

喜歡的東西

pattern = re.compile("\*\* "+ head) 

start = False 
output = [] 
for line in my_file: 
    if pattern.match(line): 
     start = True 
     continue 
    elif line.startswith("**"): # but doesn't match pattern 
     break 

    if start: 
     output.append(line) 

# now `output` should have the lines you want

來源

2017-03-01 21:15:46 turbulencetoo

正則表達式在匹配結構化數據方面很出色，就像一條總是有特定格式的行。當你必須在你關心的行之間匹配一堆隨機文本時，使用它變得非常複雜，這就是爲什麼我通常會避免你想要做的方法。 – turbulencetoo

乍一看，在我的答案'pattern.match'也可能只是一個'line.startswith（「**」+頭）' – turbulencetoo

如果您確定該字符*沒有出現在你的項目，你可以使用：

re.compile(r"\*\* "+head+r" \[\d+/\d+\]\n([^*]+)\*?")

來源

2017-03-01 21:26:32

謝謝它的作品:) – daleonpz

Python的正則表達式 - 從org-模式文件

回答

相關問題