2011-06-17 57 views
12

現在我已經有一段時間了,我想讓自己在Javascript中編寫一個解析器,用於編寫org-mode。例如,我在解析大綱(我在幾分鐘內完成)時沒有遇到任何問題,但解析實際內容要困難得多,而且我在遇到重疊列表時遇到了問題。使用Javascript解析組織模式文件

* This is a heading 
    P1 Start a paragraph here but since it is the first indentation level 
the paragraph may have a lower indentation on the next line 
    or a greater one for that matter. 

    + LI1.1 I am beginning a list here 
    + LI1.2 Here begins another list item 
    which continues here 
     and also here 
    P2 but is broken here (this line becomes a paragraph 
    outside of the first list). 
    + LI2.1 P1 Second list item. 
    - LI2.1.1 Inner list with a simple item 
    - LI2.1.2 P1 and with an item containing several paragraphs. 
     Here is the second line in the item, and now 

     LI2.1.2 P2 I begin a new paragraph still in the same item. 
     The indentation can be only higher 
    LI2.1 P2 but if the indentation is lower, it breaks the item, 
    (and the whole list), and this is a paragraph in the LI2.1 
    list item 

    - LI 2.2.1 You get the picture 
    P3 Just plain text outside of the list. 

(在上面的例子中,PXLIX.Y只有有明確顯示出新塊的開始,他們將不存在實際的文檔中。P代表段落和LI的列表項。在在HTML世界中,PX將是<p>標籤的開頭,編號只是爲了幫助保持對列表的嵌套和更改的跟蹤。)

我想知道如何解析這種顯着的白色空間的分層塊,顯然我可以一行一行地解析,而沒有任何回溯或沒有任何東西,所以它必須非常簡單,但對於s因爲我無法做到這一點。我試圖從Markdown解析器中獲得靈感,或者應該具有類似重疊功能的東西,但是對我來說(對於我看到的那些)來說,它非常好玩,充滿了正則表達式,我希望我可以寫一些更清潔的東西(組織模式「語法」在您思考時非常龐大,它會一點一點地增長,我希望整個事情都可以維護,並且允許輕鬆插入新功能)。

任何有解析這些東西經驗的人都可以給我一些指點嗎?

+0

AFAIK,沒有簡單的方法來解析這個。這些類似維基的格式在@ $$中處理很痛苦。您是要手動編寫解析器,還是在編寫/翻譯語法,並讓解析器生成器爲您創建解析器? –

+0

那麼,你的評論意味着我手工編碼,我已經開始做,但沒有找到正確的方法。也許寫一個語法會更容易,但我不知道如何處理重要的空格。我是新來的解析,所以我碰到了迄今爲​​止我嘗試過的所有東西。 :) – glmxndr

+0

有沒有正式的語法寫出來的地方?這個問題在我看來,你沒有一個令牌來結束一個聲明。有幾種語言使用白色格式而不是分號和大括號,但我想不出任何讓你像P1的例子那樣格式化第一行後面的任何縮進程度。 – Samsdram

回答

7

就像我在評論中所說的那樣,解析這個會很痛苦,就像很多類似Wiki的語言一樣。

如果您要編寫語法並讓解析器生成器爲您創建解析器,而不是手動編寫解析器,則有多種選擇。列出幾個:

我知道ANTLR能做到這一點,但它不會是微不足道的,而最重要的是,你需要去與工具(需要一點時間!)握手。我沒有花太多時間在其他兩種工具上,但是懷疑他們會用這樣一種骯髒的語言來完成工作。

去一個手寫的解析器會給你一個快速的開始,但調試,增強或重寫它將是困難的。編寫語法並讓解析器生成器爲您創建解析器將導致更輕鬆的調試,增強和重寫解析器(通過語法),但是您需要花費(相當)一些時間來學習使用該工具。

當然,如果編寫得當,手寫解析器將(最有可能)比生成的解析器更快。但是,它們之間的差異可能只有在大量源代碼的情況下才會顯着。

對不起,我沒有一個通用的策略,如何處理這與手寫解析器。

祝你好運!

+0

jison的+1。你只需要使用一個好的老的lex/yacc端口來進行解析。 – Raynos

+0

@Raynos,好的,但是這並不能完全回答如何處理顯着的空白,尤其是當LI項目具有這樣的特性時,他們對於第一行和後續行具有不同的縮進(參見問題中的示例)。 – glmxndr

+0

@subtenante [購買龍書](http://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools)。然後閱讀它。然後用你驚人的編譯器知識解決你的問題。 – Raynos

10

我喜歡解析器和編譯器理論,所以我寫了一個小解析器(手工),它能夠將您的示例代碼片段解析爲一個XML DOM Document對象。可以對其進行修改,以便生成其他類型的樹結構,如自定義AST(抽象語法樹)。

我試着讓代碼易於閱讀,以便您可以看到這樣的解析器是如何工作的。

問我是否需要更多解釋,或者希望我稍微修改一下。

你的榜樣片斷作爲輸入,聲明result = new OrgModParser().parse(input); result.xml返回:

<org-mode-document indentLevel="-1"> 
    <section indentLevel="0"> 
     <header indentLevel="0">This is a heading</header> 
      <paragraph indentLevel="1">P1 Start a paragraph here but since it is the first indentation level the paragraph may have a lower indentation on the next line or a greater one for that matter.</paragraph> 
      <list indentLevel="1"> 
       <list-item indentLevel="1"> 
        <paragraph indentLevel="2">LI1.1 I am beginning a list here</paragraph> 
       </list-item> 
       <list-item indentLevel="1"> 
        <paragraph indentLevel="2">LI1.2 Here begins another list item which continues here and also here</paragraph> 
       </list-item> 
      </list> 
     <paragraph indentLevel="1">P2 but is broken here (this line becomes a paragraph outside of the first list).</paragraph> 
     <list indentLevel="1"> 
      <list-item indentLevel="1"> 
       <paragraph indentLevel="2">LI2.1 P1 Second list item.</paragraph> 
       <list indentLevel="2"> 
        <list-item indentLevel="2"> 
         <paragraph indentLevel="3">LI2.1.1 Inner list with a simple item</paragraph> 
        </list-item> 
        <list-item indentLevel="2"> 
         <paragraph indentLevel="3">LI2.1.2 P1 and with an item containing several paragraphs. Here is the second line in the item, and now</paragraph> 
         <paragraph indentLevel="3">LI2.1.2 P2 I begin a new paragraph still in the same item. The indentation can be only higher</paragraph> 
        </list-item> 
       </list> 
       <paragraph indentLevel="2">LI2.1 P2 but if the indentation is lower, it breaks the item, (and the whole list), and this is a paragraph in the LI2.1 list item</paragraph> 
       <list indentLevel="2"> 
        <list-item indentLevel="2"> 
         <paragraph indentLevel="3">LI2.2.1 You get the picture</paragraph> 
        </list-item> 
       </list> 
      </list-item> 
     </list> 
     <paragraph indentLevel="1">P3 Just plain text outside of the list.</paragraph> 
    </section> 
</org-mode-document> 

代碼:

/* 
* File: orgmodparser.js 
* Basic usage: var object = new OrgModeParser().parse(input); 
* Works on: JScript and JScript.Net. 
* - For other JavaScript platforms, just replace or override the .createRoot() method 
*/ 

OrgModeParser = function (options) { 
    if (typeof options == "object") { 
     for (var i in options) { 
      this[i] = options[i]; 
     } 
    } 
} 

OrgModeParser.prototype = { 

    "INDENT_WIDTH" : 2, // Two spaces 
    "LINE_SEPARATOR" : "\r\n", 

    /* 
    * Each line in the input will be matched against this regexp. 
    * Only spaces are allowed as indentation characters. 
    * The symbols '*', '+' and '-' will be recognized, but only if they are followed by at least one space. 
    * Add other symbols in this regexp if you want the parser to recognize them 
    */ 
    "re" : /^(*)([\+\-\*] +)?(.*)/, 

    // This function must return a valid XML DOM document object 
    createRoot : function() { 
     var err, progIDs = ["Msxml2.DOMDocument.6.0", "Msxml2.DOMDocument.5.0", "Msxml2.DOMDocument.4.0", "Msxml2.DOMDocument.3.0", "Msxml2.DOMDocument.2.0", "Msxml2.DOMDocument.1.0", "Msxml2.DOMDocument"]; 
     for (var i = 0; i < progIDs.length; i++) { 
      try { 
       return new ActiveXObject(progIDs[i]); 
      } 
      catch (err) { 
      } 
     } 
     alert("Org-mode parser - Error - Failed to instantiate root object"); 
     return null; 
    }, 

    parse : function (text) { 

     function createNode (tagName, text) { 
      var node = root.createElement(tagName); 
      node.setAttribute("indentLevel", level); 
      if (text) { 
       var textNode = root.createTextNode(text); 
       node.appendChild(textNode); 
      } 
      return node; 
     } 

     function getContainer() { 
      if (lastNode.tagName == "section") { return lastNode; } 
      var anc = lastNode.parentNode; 
      while (anc) { 
       if (modifier == "+" || modifier == "-") { 
        if (anc.getAttribute("indentLevel") == level && anc.tagName == "list") { return anc; } 
       } 
       if (anc.getAttribute("indentLevel") < level && anc.tagName != "paragraph") { return anc; } 
       anc = anc.parentNode; 
      } 
      alert("Org-mode parser - Internal error at line: "+i);return null; 
     } 

     if (typeof text != "string") { alert("Org-mode - Type error - Input must be of type 'string'"); return null; } 

     var body; 
     var content;  // The text of the current line, without its indentation and modifier 
     var lastNode; // The node being processed 
     var indent;  // The indentation of the current line 
     var isAfterDubbleLineBreak; // Indicates if the current line follows a dubble line break 
     var line;  // The current line being processed 
     var level;  // The current indentation level; given by indent.length/this.INDENT_WIDTH. Not to confuse with the nesting level 
     var lines;  // Array. Empty lines are included. 
     var match; 
     var modifier; // This can be "*", "+", "-" or "" 
     var root; 

     isAfterDubbleLineBreak = false; 
     level = -1;  // Indentation level is -1 initially; it will be 0 for the first "*"-bloc 
     lines = text.split(this.LINE_SEPARATOR); 
     root = this.createRoot(); 
     body = root.appendChild(createNode("org-mode-document")); 
     lastNode = body; 

     for (var i = 0; i < lines .length; i++) { 
      line = lines[i]; 
      match = line.match(this.re); 
      if (match === null) { alert("org-mode parse error at line: " + i); return null; } 
      indent = match[1]; 
      level = indent.length/this.INDENT_WIDTH; 
      modifier = match[2] && match[2].charAt(0); 
      content = match[3]; 

      // These conditions tell the parser what to do when encountering a line with a given modifer 
      if (content === "") { dubbleLineBreak(); continue; } 
      else if (modifier == "+" || modifier == "-") { plus(); } 
      else if (modifier == "*") { star(); } 
      else if (modifier == "+") { plus(); } 
      else if (modifier == "-") { minus(); } 
      else if (modifier == "") { noModifier(); } 
      isAfterDubbleLineBreak = false; 
     } 
     return root; 


     function star() { 
      // The '*' modifier is not allowed on an indented line 
      if (indent) { alert("Org-mode parse error: unexpected '*' symbol at line " + i); return null; } 
      lastNode = body.appendChild(createNode("section")); 
      // The div remains the current node 
      lastNode.appendChild(createNode("header", content)); 
     } 

     function plus() { 
      var container = getContainer(); 
      var tn = container.tagName; 
      if (tn == "section" || tn == "list-item") { 
       lastNode = container.appendChild(createNode("list")); 
       lastNode = lastNode.appendChild(createNode("list-item")); 
       lastNode = lastNode.appendChild(createNode("paragraph", content)); 
      } else if (tn == "list") { 
       lastNode = container.appendChild(createNode("list-item")); 
       lastNode = lastNode.appendChild(createNode("paragraph", content)); 
      } 
      else alert("Org-mode parser - Internal error - Bad container tag name: " + tn); 
      lastNode.setAttribute("indentLevel", Number(lastNode.getAttribute("indentLevel")) + 1); 
     } 

     function minus() { plus(); } 

     function noModifier() { 
      if (lastNode.tagName == "paragraph" && !isAfterDubbleLineBreak && (lastNode.getAttribute("indentLevel") == 1 || level >= lastNode.getAttribute("indentLevel"))) { 
       lastNode.childNodes[0].appendData(" " + content); 
      } else { 
       var container = getContainer(); 
       lastNode = container.appendChild(createNode("paragraph", content)); 
      } 
     } 

     function dubbleLineBreak() { 
      while (lines[i+1] && /^\s*$/.test(lines[i+1])) { i++; } 
      isAfterDubbleLineBreak = true; 
     } 

    } 
}; 
+0

不錯,這是一個更好的地方,因爲它(幾乎)工作,因爲我期望:)),但我很不情願依靠DOM,首先是因爲我期望在瀏覽器外使用解析器,其次是因爲org-mode的一些其他功能不能像原樣那樣插入到DOM模型中。 (我幾乎和我所期待的一樣,因爲LI2.1.2應該有兩個段落:在雙重新行中打破了標記。) – glmxndr

+0

感謝您的反饋。我已經更正了段落的配音新換行符錯誤,並將輸出類型更改爲XML對象,以便它不需要瀏覽器即可工作。 – Luc125

+0

謝謝。你會發現我很煩人,但依賴於Windows ActiveX並不比依靠瀏覽器好得多...... :)但我明白你的建議的要點。 – glmxndr

2

有一個可用here一個Javascript組織模式解析器。