如何解析JavaScript和正則表達式

中文提供複雜的項目，我嘗試使用正則表達式來解析中文提供文件，的Javascript我似乎不能找到一個妥善的解決辦法。在以下示例中，bj是一個包含書目項目的子項的數組。我不得不寫一個相當長的正則表達式來考慮值可以分成多行的項目，缺少大括號（{}）或者句法上有錯誤逗號（例如，最後一個字段不應以逗號結尾，但某些TeX編輯器不要抱怨）。如何解析JavaScript和正則表達式

這是我使用來測試我的正則表達式是什麼：

@inproceedings{Carrel2005, 
    title  = {{Algorithm} for near-optimal autonomous resource management}, 
    author  = {Carrel, Ândrew and Palmer, Phil}, 
    notes  = nonote , 
    booktitle = {8th International Symposium on Artificial {Intelligence, 
       Robotics}, and Automation in Space}, 
    year  = {2005} 
    blahblah = error, 
}

正如你可以看到，一些值在分成兩行，可以有內部花括號。我一直在試圖改善正則表達式如下：

var txt = "@inproceedings{Carrel2005, \n" + 
      " title  = {{Algorithm} for near-optimal autonomous resource management}, \n" + 
      " author  = {Carrel, Ândrew and Palmer, Phil}, \n" + 
      " notes  = nonote ,\n" + 
      " booktitle = {8th International Symposium on Artificial Intelligence, \n" + 
      "     Robotics and Automation in Space}, \n" + 
      " year  = {2005} \n" + 
      " blahblah = error,\n}"; 

bj = txt.match(/\w*[\t ]*=[\t ]*(\{[\u0020-\u0080\u00A1-\u00FF\u0300-\u036F\t\r\n]*?}|[a-zA-Z0-9]+)[\t ]*(,(?!\s*}))?/g);

解釋：

\w*    A word for the field name. 
[\t ]*=[\t ]*  Any number of spaces or tabs after and before the equal sign. 
(    Start of group 1. 
    \{    Option 11: starts by an opening curly brace. 
    [    Start of character class AAA. 
    unicode-set Letters (basic Latin plus some extensions) 
    \t\r\n  ... or whitespace. 
    ]*?    End of character class AAA (with LAZY repetition) 
|     End of option 11, start of option 12: 
    [a-zA-Z0-9]+ One or more characters (no underscore or whitespace allowed). 
)     End of option 12 and group 1. 
[\t ]*   Any number of tabs or spaces. 
(    Start of group 2: 
    ,    A literal comma 
    (?!\s*})  ...if it is not followed by whitespace and closing curly braces. 
)?    End of group 2. ? denotes it is optional.

我一直沒能匹配由多個花括號開始字段（如{{Algorithm} for near...）也不正確匹配那些在內部找到序列},的地方。

來源

2015-12-11 Carles Araguz

嘗試用* RegEx *編寫*解析器*總是讓我想起這個答案：http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 你爲什麼不寫一個合適的解析器？ – Marc

使用正則表達式可靠地匹配大括號'{}'基本上是不可能的。問題在於，沒有辦法存儲您看過多少個開放的大括號，因此無法知道何時完成。您將需要編寫（或使用）適當的解析。這可能不會太困難，正則表達式的幫助 –

正如我在評論中提到的，無法匹配任意大括號，因爲這需要一些狀態來存儲您所看到的數字。你需要一個分析器，然後添加狀態時，它會看起來像：

var txt = "@inproceedings{Carrel2005, \n" + 
    " title  = {{Algorithm} for near-optimal autonomous resource management}, \n" + 
    " author  = {Carrel, Ândrew and Palmer, Phil}, \n" + 
    " notes  = nonote ,\n" + 
    " booktitle = {8th International Symposium on Artificial Intelligence, \n" + 
    "     Robotics and Automation in Space}, \n" + 
    " year  = {2005} \n" + 
    " blahblah = error,\n}"; 


function parseBibTexLine (text) { 
    var m = text.match(/^\s*(\S+)\s*=\s*/); 
    if (!m) { 
     console.log('line: "' + text + '"'); 
     throw new Error('Unrecogonised line format'); 
    } 
    var name = m[1]; 
    var search = text.slice(m[0].length); 
    var re = /[\n\r,{}]/g; 
    var braceCount = 0; 
    var length = m[0].length; 
    do { 
     m = re.exec(search); 
     if (m[0] === '{') { 
      braceCount++; 
     } else if (m[0] === '}') { 
      if (braceCount === 0) { 
       throw new Error('Unexpected closing brace: "}"'); 
      } 
      braceCount--; 
     } 
    } while (braceCount > 0); 
    return { 
     field:name, 
     value: search.slice(0, re.lastIndex), 
     length:length + re.lastIndex + m[0].length 
    }; 
} 

function parseBibTex (text) { 
    var m = text.match(/^\s*@([^{]+){([^,\n]+)[,\n]/); 
    if (!m) { 
     throw new Error('Unrecogonised header format'); 
    } 
    var result = { 
     typeName: m[1].trim(), 
     citationKey: m[2].trim() 
    } 
    text = text.slice(m[0].length).trim(); 
    while (text[0] !== '}') { 
     var pair = parseBibTexLine(text); 
     result[pair.field] = pair.value; 
     text = text.slice(pair.length).trim(); 
    } 
    return result; 
} 

console.log(parseBibTex(txt));

我肯定沒有測試此深，但是當您輸入運行我得到：

{ typeName: 'inproceedings', 
    citationKey: 'Carrel2005', 
    title: '{{Algorithm} for near-optimal autonomous resource management}', 
    author: '{Carrel, Ândrew and Palmer, Phil}', 
    notes: 'nonote ,', 
    booktitle: '{8th International Symposium on Artificial Intelligence, \n     Robotics and Automation in Space}', 
    year: '{2005}', 
    blahblah: 'error,' }

來源

2015-12-11 11:59:41

謝謝！我終於在這裏使用了bibtex解析器：https://github.com/mikolalysenko/bibtex-parser，但是你的解析器看起來同樣偉大。 –

如果對我來說，上面的解析器沒有完成，我建議看看[我的項目]（https://github.com/digitalheir/bibtex-js/）。它與*馴服BeaST * – Maarten

如何解析JavaScript和正則表達式

回答

相關問題