2017-03-01 23 views
2

得到物品,我有以下org-模式語法:Python的正則表達式 - 從org-模式文件

** Hardware [0/1] 
- [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6] 
- [X] Introduction to Networking - Charles Severance 
- [ ] A Tour of C++ - Bjarne Stroustrup 
- [ ] C++ How to Program - Paul Deitel 
- [X] Computer Systems - Randal Bryant 
- [ ] The C programming language - Brian Kernighan 
- [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2 

,我想提取的物品,如:

getitems "Hardware" 

我應該得到:

- [ ] adapt a programmable motor to a tripod to be used for panning 

如果我要 「讀 - 健康」,我應該得到:

- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2 

我現在用的是以下模式:

pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL) 

詢問時輸出 「讀 - 技術」 是:

- [X] Introduction to Networking - Charles Severance 
    - [ ] A Tour of C++ - Bjarne Stroustrup 
    - [ ] C++ How to Program - Paul Deitel 
    - [X] Computer Systems - Randal Bryant 
    - [ ] The C programming language - Brian Kernighan 
    - [ ] Beginning Linux Programming -Matthew and Stones 
    ** Reading - Health [3/4] 
    - [ ] Patrick McKeown - The Oxygen Advantage 
    - [X] Total Knee Health - Martin Koban 
    - [X] Supple Leopard - Kelly Starrett 
    - [X] Convict Conditioning 1 and 2 

我也試過:

pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL) 

這除了最後一個之外,最後一個工作正常。

輸出要求時,「讀 - 健康」:

- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 

正如你所看到的,它不會在最後一行匹配。

我使用python 2.7,並findall。

+0

'\ * \ *閱讀 - 健康(*?)(?:\ * \ *。 | $)' – JazZ

回答

1

您可以用實現它

import re 

string = """ 
** Hardware [0/1] 
- [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6] 
- [X] Introduction to Networking - Charles Severance 
- [ ] A Tour of C++ - Bjarne Stroustrup 
- [ ] C++ How to Program - Paul Deitel 
- [X] Computer Systems - Randal Bryant 
- [ ] The C programming language - Brian Kernighan 
- [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2 
""" 

def getitems(section): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
     items = rx.search(string) 
     return items.group('block') 
    except: 
     return None 

items = getitems('Reading - Technology') 
print(items) 

查看working on ideone.com


代碼的心臟是(濃縮)的表達:

^\*{2}.+[\n\r]  # match the beginning of the line, followed by two stars, anything else in between and a newline 
(?P<block>   # open group "block" 
    (?:    # non-capturing group 
     (?!^\*{2}) # a neg. lookahead, making sure no ** follows at the beginning of a line 
     [\s\S]  # any character... 
    )+    # ...at least once 
)     # close group "block" 

您的搜索字符串**後的實際代碼插入。請參閱Reading - Technology的演示regex101.com


作爲後續,你還可只返回 選擇的值,就像這樣:

def getitems(section, selected=None): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
     items = rx.search(string).group('block') 
     if selected: 
      rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE) 
      try: 
       selected_items = rxi.findall(items) 
       return selected_items 
      except: 
       return None 
     return items 
    except: 
     return None 

items = getitems('Reading - Health', selected=True) 
print(items) 
+0

謝謝,它改進了整體代碼..也regex101.com是一個很棒的網站 – daleonpz

+0

@daleonpz:增加了一個版本,只返回選定的值。 – Jan

0

不確定你需要整個比賽的正則表達式。我只是使用正則表達式來匹配**行,然後返回行,直到看到下一個**行。

喜歡的東西

pattern = re.compile("\*\* "+ head) 

start = False 
output = [] 
for line in my_file: 
    if pattern.match(line): 
     start = True 
     continue 
    elif line.startswith("**"): # but doesn't match pattern 
     break 

    if start: 
     output.append(line) 

# now `output` should have the lines you want 
+0

正則表達式在匹配結構化數據方面很出色,就像一條總是有特定格式的行。當你必須在你關心的行之間匹配一堆隨機文本時,使用它變得非常複雜,這就是爲什麼我通常會避免你想要做的方法。 – turbulencetoo

+0

乍一看,在我的答案'pattern.match'也可能只是一個'line.startswith(「**」+頭)' – turbulencetoo

1

如果您確定該字符*沒有出現在你的項目,你可以使用:

re.compile(r"\*\* "+head+r" \[\d+/\d+\]\n([^*]+)\*?") 
+0

謝謝它的作品:) – daleonpz