2015-12-11 25 views
1

中文提供複雜的項目,我嘗試使用正則表達式來解析中文提供文件,的Javascript我似乎不能找到一個妥善的解決辦法。在以下示例中,bj是一個包含書目項目的子項的數組。我不得不寫一個相當長的正則表達式來考慮值可以分成多行的項目,缺少大括號({})或者句法上有錯誤逗號(例如,最後一個字段不應以逗號結尾,但某些TeX編輯器不要抱怨)。如何解析JavaScript和正則表達式

這是我使用來測試我的正則表達式是什麼:

@inproceedings{Carrel2005, 
    title  = {{Algorithm} for near-optimal autonomous resource management}, 
    author  = {Carrel, Ândrew and Palmer, Phil}, 
    notes  = nonote , 
    booktitle = {8th International Symposium on Artificial {Intelligence, 
       Robotics}, and Automation in Space}, 
    year  = {2005} 
    blahblah = error, 
} 

正如你可以看到,一些值在分成兩行,可以有內部花括號。我一直在試圖改善正則表達式如下:

var txt = "@inproceedings{Carrel2005, \n" + 
      " title  = {{Algorithm} for near-optimal autonomous resource management}, \n" + 
      " author  = {Carrel, Ândrew and Palmer, Phil}, \n" + 
      " notes  = nonote ,\n" + 
      " booktitle = {8th International Symposium on Artificial Intelligence, \n" + 
      "     Robotics and Automation in Space}, \n" + 
      " year  = {2005} \n" + 
      " blahblah = error,\n}"; 

bj = txt.match(/\w*[\t ]*=[\t ]*(\{[\u0020-\u0080\u00A1-\u00FF\u0300-\u036F\t\r\n]*?}|[a-zA-Z0-9]+)[\t ]*(,(?!\s*}))?/g); 

解釋:

\w*    A word for the field name. 
[\t ]*=[\t ]*  Any number of spaces or tabs after and before the equal sign. 
(    Start of group 1. 
    \{    Option 11: starts by an opening curly brace. 
    [    Start of character class AAA. 
    unicode-set Letters (basic Latin plus some extensions) 
    \t\r\n  ... or whitespace. 
    ]*?    End of character class AAA (with LAZY repetition) 
|     End of option 11, start of option 12: 
    [a-zA-Z0-9]+ One or more characters (no underscore or whitespace allowed). 
)     End of option 12 and group 1. 
[\t ]*   Any number of tabs or spaces. 
(    Start of group 2: 
    ,    A literal comma 
    (?!\s*})  ...if it is not followed by whitespace and closing curly braces. 
)?    End of group 2. ? denotes it is optional. 

我一直沒能匹配由多個花括號開始字段(如{{Algorithm} for near...)也不正確匹配那些在內部找到序列},的地方。

+2

嘗試用* RegEx *編寫*解析器*總是讓我想起這個答案:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 你爲什麼不寫一個合適的解析器? – Marc

+1

使用正則表達式可靠地匹配大括號'{}'基本上是不可能的。問題在於,沒有辦法存儲您看過多少個開放的大括號,因此無法知道何時完成。您將需要編寫(或使用)適當的解析。這可能不會太困難,正則表達式的幫助 –

回答

3

正如我在評論中提到的,無法匹配任意大括號,因爲這需要一些狀態來存儲您所看到的數字。你需要一個分析器,然後添加狀態時,它會看起來像:

var txt = "@inproceedings{Carrel2005, \n" + 
    " title  = {{Algorithm} for near-optimal autonomous resource management}, \n" + 
    " author  = {Carrel, Ândrew and Palmer, Phil}, \n" + 
    " notes  = nonote ,\n" + 
    " booktitle = {8th International Symposium on Artificial Intelligence, \n" + 
    "     Robotics and Automation in Space}, \n" + 
    " year  = {2005} \n" + 
    " blahblah = error,\n}"; 


function parseBibTexLine (text) { 
    var m = text.match(/^\s*(\S+)\s*=\s*/); 
    if (!m) { 
     console.log('line: "' + text + '"'); 
     throw new Error('Unrecogonised line format'); 
    } 
    var name = m[1]; 
    var search = text.slice(m[0].length); 
    var re = /[\n\r,{}]/g; 
    var braceCount = 0; 
    var length = m[0].length; 
    do { 
     m = re.exec(search); 
     if (m[0] === '{') { 
      braceCount++; 
     } else if (m[0] === '}') { 
      if (braceCount === 0) { 
       throw new Error('Unexpected closing brace: "}"'); 
      } 
      braceCount--; 
     } 
    } while (braceCount > 0); 
    return { 
     field:name, 
     value: search.slice(0, re.lastIndex), 
     length:length + re.lastIndex + m[0].length 
    }; 
} 

function parseBibTex (text) { 
    var m = text.match(/^\s*@([^{]+){([^,\n]+)[,\n]/); 
    if (!m) { 
     throw new Error('Unrecogonised header format'); 
    } 
    var result = { 
     typeName: m[1].trim(), 
     citationKey: m[2].trim() 
    } 
    text = text.slice(m[0].length).trim(); 
    while (text[0] !== '}') { 
     var pair = parseBibTexLine(text); 
     result[pair.field] = pair.value; 
     text = text.slice(pair.length).trim(); 
    } 
    return result; 
} 

console.log(parseBibTex(txt)); 

我肯定沒有測試此深,但是當您輸入運行我得到:

{ typeName: 'inproceedings', 
    citationKey: 'Carrel2005', 
    title: '{{Algorithm} for near-optimal autonomous resource management}', 
    author: '{Carrel, Ândrew and Palmer, Phil}', 
    notes: 'nonote ,', 
    booktitle: '{8th International Symposium on Artificial Intelligence, \n     Robotics and Automation in Space}', 
    year: '{2005}', 
    blahblah: 'error,' } 
+0

謝謝!我終於在這裏使用了bibtex解析器:https://github.com/mikolalysenko/bibtex-parser,但是你的解析器看起來同樣偉大。 –

+0

如果對我來說,上面的解析器沒有完成,我建議看看[我的項目](https://github.com/digitalheir/bibtex-js/)。它與*馴服BeaST * – Maarten