ANTLR4 - 解析JavaScript語法中的正則表達式文字

我正在使用ANTLR4爲某些JavaScript預處理器（基本上它標記一個JavaScript文件並提取每個字符串文本）生成一個Lexer。ANTLR4 - 解析JavaScript語法中的正則表達式文字

我使用了最初爲Antlr3製作的語法，併爲v4導入了相關部分（只有詞法分析規則）。

我只有一個單一的問題，其餘的：我不知道如何處理極端案例以正則表達式的文字，像這樣：

log(Math.round(v * 100)/100 + ' msec/sample');

的/ 100 + ' msec/被解釋爲正則表達式的文字，因爲詞法規則是總是活躍。

我想什麼是將這種邏輯（C＃代碼，我需要的JavaScript，只是我不知道如何去適應它。）：

/// <summary> 
    /// Indicates whether regular expression (yields true) or division expression recognition (false) in the lexer is enabled. 
    /// These are mutual exclusive and the decision which is active in the lexer is based on the previous on channel token. 
    /// When the previous token can be identified as a possible left operand for a division this results in false, otherwise true. 
    /// </summary> 
    private bool AreRegularExpressionsEnabled 
    { 
     get 
     { 
      if (Last == null) 
      { 
       return true; 
      } 

      switch (Last.Type) 
      { 
       // identifier 
       case Identifier: 
       // literals 
       case NULL: 
       case TRUE: 
       case FALSE: 
       case THIS: 
       case OctalIntegerLiteral: 
       case DecimalLiteral: 
       case HexIntegerLiteral: 
       case StringLiteral: 
       // member access ending 
       case RBRACK: 
       // function call or nested expression ending 
       case RPAREN: 
        return false; 

       // otherwise OK 
       default: 
        return true; 
      } 
     } 
    }

此規則存在於舊語法作爲一個內聯謂詞，像這樣：

RegularExpressionLiteral 
    : { AreRegularExpressionsEnabled }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart* 
    ;

但我不知道如何在ANTLR4中使用這種技術。

在ANTLR4的書中，有關於在解析器級別解決這類問題的一些建議（第12.2章 - 上下文敏感的詞法問題），但我不想使用解析器。我只想提取所有的令牌，除了字符串文字之外，一切都保持不變，並且保持解析不受影響。

任何建議將非常感謝，謝謝！

來源

2016-08-12 A. Chiesa

這顯然是你無法靠lexing獨自解決的問題。 Lexing僅爲特定輸入提供令牌值。它沒有任何信息如何處理RE輸入。如果特定輸入序列的含義發生變化（取決於某些上下文），那麼只能在解析器端處理它，或者通過在搜索之後添加語義階段來手動處理。 –

雖然您的評論屬實，但在提及lexing的抽象任務時，在Antlr3中，您可以將小部分邏輯附加到詞法分析器語法，只需解決我的問題即可。我在v3中不需要解析器。我現在在v4中需要它嗎？ –

您仍然可以在ANTLR4中使用謂詞，但是[語法不同]（http://stackoverflow.com/documentation/antlr4/3271/lexer-rules/11237/actions-and-semantic-predicates#t=201608131645183220069）。另外，出於性能原因（或者更好的是，在第一個'/'delimiter char之後）將謂詞放在規則的末尾。 –

我在這裏發佈最終的解決方案，開發適應現有的ANTLR4的新語法，並解決JavaScript語法的差異。

我只是發佈相關部分，給別人提供關於工作策略的線索。

規則被修改如下：

RegularExpressionLiteral 
    : DIV {this.isRegExEnabled()}? RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart* 
    ;

isRegExEnabled在詞法語法頂部的@members部分定義的函數，如下所示：

@members { 
EcmaScriptLexer.prototype.nextToken = function() { 
    var result = antlr4.Lexer.prototype.nextToken.call(this, arguments); 
    if (result.channel !== antlr4.Lexer.HIDDEN) { 
    this._Last = result; 
    } 

    return result; 
} 

EcmaScriptLexer.prototype.isRegExEnabled = function() { 
    var la = this._Last ? this._Last.type : null; 
    return la !== EcmaScriptLexer.Identifier && 
    la !== EcmaScriptLexer.NULL && 
    la !== EcmaScriptLexer.TRUE && 
    la !== EcmaScriptLexer.FALSE && 
    la !== EcmaScriptLexer.THIS && 
    la !== EcmaScriptLexer.OctalIntegerLiteral && 
    la !== EcmaScriptLexer.DecimalLiteral && 
    la !== EcmaScriptLexer.HexIntegerLiteral && 
    la !== EcmaScriptLexer.StringLiteral && 
    la !== EcmaScriptLexer.RBRACK && 
    la !== EcmaScriptLexer.RPAREN; 
}}

正如你可以看到，二函數被定義，其中一個是覆蓋詞法分析器的方法，該方法包裝現有的nextToken並保存最後的非註釋或空白標記以供參考。然後，語義謂詞調用isRegExEnabled檢查最後一個有意義的標記是否與RegEx文字的存在兼容。如果不是，則返回false。

感謝盧卡斯Trzesniewski的評論：它指出我在正確的方向，併爲帕特里克Hulsmeijer在第3版的原始工作。

來源

2016-08-23 07:02:40

ANTLR4 - 解析JavaScript語法中的正則表達式文字

回答

相關問題