在詞法分析器中正則表達式匹配的問題

我正在製作一個相當複雜的模板引擎，因爲它將使用編程語言（如if語句和循環）中的典型構造。在詞法分析器中正則表達式匹配的問題

目前，我正在研究詞法分析器，我相信這涉及將字符流轉換爲記號的工作。我想要做的是捕獲HTML文檔中的某些結構，稍後可由解析器處理。

這是語法的例子：

<head> 

    <title>Template</title> 
    <meta charset="utf-8"> 

</head> 

<body> 

    <h1>{{title}}</h1> 

    <p>This is also being matched.</p> 

    {{#myName}} 
     <p>My name is {{myName}}</p> 
    {{/}} 

    <p>This content too.</p> 

    {{^myName}} 
     <p>I have on name.</p> 
    {{/}} 

    <p>No matching here...</p> 

</body>

我想只掃描起點之間的一切 '{{' 字符和結束「 }}字符。因此，{{title}}應該是一個匹配項，以及{{#myName}}，導致{{/}}的文本和內容，這應該是第二次匹配。

我不是特別的最好的正則表達式，我敢肯定這是與模式的問題，我已經設計，它是這樣的：

({{([#\^]?)([a-zA-Z0-9\._]+)}}([\w\W]+){{\/?}})

我看這是匹配兩個{字符，然後是＃或^任何包含大寫或小寫字母的單詞，以及任何數字，點或下劃線。匹配關閉}}字符之後的任何內容，直到滿足{{/}}個字符，但是/}}部分是可選的。

該問題在下面的鏈接中可見。這是匹配不在{{和}}塊內的文本。我想知道它與\ w和\ W的使用有關，因爲如果我特別指定了我想在該集合中匹配哪些字符，它似乎就可以工作。

正則表達式測試是here。我看過正則表達式是捕獲所有不是HTML的文本的共享列表，並且我注意到它使用了我無法理解的lookahead，或者理解他們爲什麼會幫助我。

有人可以通過指出正則表達式的問題來幫助我，還是在創建詞法分析器時我是否會採用錯誤的方法？

我希望我已經提供了足夠的信息，並感謝您的幫助！

來源

2013-12-15 Mark

你太匹配了。你應該簡單地將單個的「{{things}}」轉換爲記號。首先，您的示例包含嵌套的「{{things}}」 - 肯定不能是單個標記（當然，正則表達式不足以捕獲該類結構）。事實上，任何具有結構的東西都應該發生在語法中，而不是詞法分析器中。 – tripleee

@tripleee我打算這麼做，但我想先捕獲它們，然後再分解它們，因爲我對其他內容不感興趣。 – Mark

您的模式不起作用，因爲[\w\W]+將採用所有可能的字符，直到字符串的最後一個{{/}}。量詞（即+,*,{1,3},?）默認爲貪婪。爲了獲得一個懶惰量詞必須其後添加?：[\w\W]+?

的模式來處理嵌套結構：

$pattern = <<<'LOD' 
~ 
{{ 
(?|     # branch reset group: the interest of this feature is that 
        # capturing group numbers are the same in all alternatives 
    ([\w.]++)}}  # self-closing tag: capturing group 1: tag name 
    |     # OR 
    ([#^][\w.]++)}} # opening tag:  capturing group 1: tag name 
    (    # capturing group 2: content 
     (?>   # atomic group: three possible content type 
      [^{]++ # all characters except { 
      |   # OR 
      {(?!{) # { not followed by another { 
      |   # OR 
      (?R)  # an other tag is met, attempt the whole pattern again 
     )*   # repeat the atomic group 0 or more times 
    )    # close the second capturing group 
    {{/}}   # closing tag 
)     # close the branch reset group 
~x 
LOD; 

preg_match_all($pattern, $html, $matches); 

var_dump($matches);

要獲得所有嵌套的水平，你可以用這個模式：

$pattern = <<<'LOD' 
~ 
(?=(       # open a lookahead and the 1st capturing group 
    {{ 
    (?| 
     ([\w.]++)}} 
     | 
     ([#^][\w.]++)}} 
     (      # ?R was changed to ?1 because I don't want to 
     (?>[^{]++|{(?!{)|(?1))* # repeat the whole pattern but only the 
     )      # subpattern in the first capturing group 
     {{/}} 
    ) 
)        # close the 1st capturing group 
)        # and the lookahead 
~x 
LOD; 

preg_match_all($pattern, $html, $matches); 

var_dump($matches);

此模式只是封閉在先行和捕獲組中的第一個模式。這個構造允許捕獲重疊的子串。

更詳細的信息有關在這兩種模式中使用的正則表達式特性：

possessive quantifiers++

atomic groups(?>..)

lookahead(?=..)，(?!..)

branch reset group(?|..|..)

recursion(?R)，(?1)

來源

2013-12-15 18:57:35

謝謝。我一定會看看這些資源。正則表達式有很多。 – Mark