如何解析標記文本以進一步處理？

在Edit-1處查看更新的輸入和輸出數據。如何解析標記文本以進一步處理？

我試圖完成轉彎

 
+ 1 
+ 1.1 
    + 1.1.1 
    - 1.1.1.1 
    - 1.1.1.2 
+ 1.2 
    - 1.2.1 
    - 1.2.2 
- 1.3 
+ 2 
- 3

爲Python數據結構，如

[{'1': [{'1.1': {'1.1.1': ['1.1.1.1', '1.1.1.2']}, '1.2': ['1.2.1', '1.2.2']}, '1.3'], '2': {}}, ['3',]]

我已經看了很多不同的維基標記語言，降價，重組後的文本，等等，但它們對於我理解它是如何工作的都非常複雜，因爲它們必須覆蓋大量的標記和語法（我只需要大部分這些「列表」部分，但當然轉換爲python而不是html）。

我也看過了標記器，詞法分析器和解析器，但它們又比我需要的複雜得多，而且我能理解。

我不知道從哪裏開始，並希望在這個問題上可能的幫助。由於

編輯-1：是的字符在該行事務的開始，從之前需要輸出的，現在可以看出的是，*表示有孩子的根節點，該+有孩子和-沒有孩子（根或其他），只是額外的信息屬於該節點。該*並不重要，可與+（我能獲得root身份的其他方式。）

因此，新規定將只使用*表示一個節點有或無子女和互換 -不能給孩子。我也改變了它，所以關鍵不是*之後的文字，因爲這將毫無疑問地改變成實際的標題。

例如

 
* 1 
* 1.1 
* 1.2 
    - Note for 1.2 
* 2 
* 3 
- Note for root

會給

[{'title': '1', 'children': [{'title': '1.1', 'children': []}, {'title': '1.2', 'children': []}]}, {'title': '2', 'children': [], 'notes': ['Note for 1.2', ]}, {'title': '3', 'children': []}, 'Note for root']

或者，如果你有另一個想法代表在python的輪廓，然後把它向前。

來源

2009-07-07 Rigsby

完成和完成的。我編輯了這兩個東西。 – Rigsby 2009-07-07 07:10:18

編輯：由於在規範的澄清和改變我已經編輯我的代碼，仍然使用一個明確的Node類爲澄清的一箇中間步驟 - 邏輯是把線列表到列表然後將節點列表轉換爲樹（通過適當地使用它們的indent屬性），然後以可讀形式打印該樹（這只是一個「調試幫助」步驟，用於檢查樹是否構造良好，以及當然可以在腳本的最終版本中得到註釋 - 當然，它將從文件中取出行，而不是將它們硬編碼爲調試！ - ），最後構建所需的Python結構並打印它。下面的代碼，併爲我們以後的結果是幾乎作爲OP指定有一個例外看 - 但是，代碼第一：

import sys 

class Node(object): 
    def __init__(self, title, indent): 
    self.title = title 
    self.indent = indent 
    self.children = [] 
    self.notes = [] 
    self.parent = None 
    def __repr__(self): 
    return 'Node(%s, %s, %r, %s)' % (
     self.indent, self.parent, self.title, self.notes) 
    def aspython(self): 
    result = dict(title=self.title, children=topython(self.children)) 
    if self.notes: 
     result['notes'] = self.notes 
    return result 

def print_tree(node): 
    print ' ' * node.indent, node.title 
    for subnode in node.children: 
    print_tree(subnode) 
    for note in node.notes: 
    print ' ' * node.indent, 'Note:', note 

def topython(nodelist): 
    return [node.aspython() for node in nodelist] 

def lines_to_tree(lines): 
    nodes = [] 
    for line in lines: 
    indent = len(line) - len(line.lstrip()) 
    marker, body = line.strip().split(None, 1) 
    if marker == '*': 
     nodes.append(Node(body, indent)) 
    elif marker == '-': 
     nodes[-1].notes.append(body) 
    else: 
     print>>sys.stderr, "Invalid marker %r" % marker 

    tree = Node('', -1) 
    curr = tree 
    for node in nodes: 
    while node.indent <= curr.indent: 
     curr = curr.parent 
    node.parent = curr 
    curr.children.append(node) 
    curr = node 

    return tree 


data = """\ 
* 1 
* 1.1 
* 1.2 
    - Note for 1.2 
* 2 
* 3 
- Note for root 
""".splitlines() 

def main(): 
    tree = lines_to_tree(data) 
    print_tree(tree) 
    print 
    alist = topython(tree.children) 
    print alist 

if __name__ == '__main__': 
    main()

運行時，它會產生：

1 
    1.1 
    1.2 
    Note: 1.2 
2 
3 
Note: 3 

[{'children': [{'children': [], 'title': '1.1'}, {'notes': ['Note for 1.2'], 'children': [], 'title': '1.2'}], 'title': '1'}, {'children': [], 'title': '2'}, {'notes': ['Note for root'], 'children': [], 'title': '3'}]

除了按鍵的順序（這是無關緊要的，在一個字典不能保證，當然），這是幾乎的要求 - 除了這裏所有筆記顯示爲字典條目與notes鍵和這是一個值字符串列表（但如果列表爲空，則大致如問題示例中所做的那樣，註釋條目將被省略）。

在當前版本的問題中，如何表示筆記有點不清楚;一個音符顯示爲獨立字符串，另一個音符顯示爲值爲字符串的條目（而不是我使用的字符串列表）。目前還不清楚這個詞應該是什麼意思，表明這個音符必須作爲一個獨立的字符串出現在一個案例中，並且作爲所有其他詞典中的一個詞條出現，所以我使用的這個計劃更加規範。如果一個音符（如果有的話）是一個單一的字符串而不是一個列表，這是否意味着如果一個節點出現多個音符，那麼這是一個錯誤？在後一方面，我使用的這種方案更通用（讓節點從0開始有任意數量的音符，而不是0或1，這在問題中顯然是隱含的）。

寫了這麼多的代碼（編輯前的答案大概只要一段時間，並幫助澄清和更改規範）提供（我希望）99％的期望解決方案，我希望這可以滿足原始海報，因爲編碼和/或規格的最後一些調整，使他們匹配對方應該很容易爲他做！

來源

2009-07-07 05:16:12

我已更新我的帖子，嘗試澄清事情。現在我表明，*或 - 問題和我確定了第一個輸出（{'1.2.3'}應該只是一個字符串，而不是像我一樣的字典。） – Rigsby 2009-07-07 07:09:12

由於您正在處理大綱情況，因此可以通過使用堆棧來簡化操作。基本上，您要創建一個與輪廓深度對應的堆棧。當您解析一條新線並且輪廓深度已增加時，您會將一個新的dict推送到堆棧頂部前一個dict所引用的堆棧中。當您解析深度較低的線條時，可以彈出堆疊以返回到父項。當你遇到一條深度相同的線時，你可以將它添加到堆棧頂部的dict。

來源

2009-07-07 04:59:28 ealdent

爲了得到真正的喜歡，您可以使用項目的內容和重新匹配，以確保下一個項目以它加上一個點加數字（s）開始。 – Kurt 2009-07-07 05:03:30

解析樹時，堆棧是非常有用的數據結構。您只需將最後添加的節點的路徑始終保留在堆棧的根目錄下，以便您可以通過縮進的長度找到正確的父節點。這樣的事情應該可以分析您的最後一個例子工作：

import re 
line_tokens = re.compile('(*)(\\*|-) (.*)') 

def parse_tree(data): 
    stack = [{'title': 'Root node', 'children': []}] 
    for line in data.split("\n"): 
     indent, symbol, content = line_tokens.match(line).groups()   
     while len(indent) + 1 < len(stack): 
      stack.pop() # Remove everything up to current parent 
     if symbol == '-': 
      stack[-1].setdefault('notes', []).append(content) 
     elif symbol == '*': 
      node = {'title': content, 'children': []} 
      stack[-1]['children'].append(node) 
      stack.append(node) # Add as the current deepest node 
    return stack[0]

來源

2009-07-07 09:47:20

使用語法`重新爲非常類似於YAML。它有一些差異，但它很容易學習 - 它的主要焦點是人類可讀（和可寫）。

看看Yaml網站。那裏有一些Python綁定，文檔和其他東西。

http://www.yaml.org

來源

2009-07-21 13:45:08

如何解析標記文本以進一步處理？

回答

相關問題