如何獲取正則表達式字符串的AST？

2

boost :: regex似乎在basic_regex_parser.hpp中有一個手寫的遞歸下降解析器。儘管感覺非常像重新發明輪子，但在自己編寫boost :: spirit語法時，您可能會更快，尤其是在大量正則表達式格式的情況下。

2011-10-05 08:56:03 thiton

+0

我剛剛[做到了這一點（http://stackoverflow.com/a/21419351/85371）。然而，公平的警告：我正在用語法變化來表達我的想法。（雖然我已經測試了支持的語法，但是） – sehe

9

我認爲Boost Xpressive必須能夠'幾乎'開箱即用。

xpressive is an advanced, object-oriented regular expression template library for C++. Regular expressions can be written as strings that are parsed at run-time, or as expression templates that are parsed at compile-time. Regular expressions can refer to each other and to themselves recursively, allowing you to build arbitrarily complicated grammars out of them.

我要看看我是否能確認（用小樣本）。

其他想法包括使用Boost Spirit與通用utree工具「存儲」AST。你必須重現一個語法（這對於常見的Regex語法子集來說相對簡單），所以這可能意味着更多的工作。

進展報告1

看着xpressive中，我取得了一些進展。我使用DDD偉大的圖形數據顯示器獲得了一些漂亮的照片。但不夠漂亮。

然後我更多地探討了'代碼'的一面：Xpressive建立在Boost Proto上。它使用Proto至定義 DSEL，它直接在C++代碼中建模正則表達式。 Proto從C++代碼（通過重載所有可能的操作符）完全一般地生成表達式樹（通用AST，如果您願意的話）。該庫（在這種情況下是Xpressive）然後需要通過來定義語義，例如走樹並且例如

建設領域的具體表達式樹
標註/語義信息裝飾也
可能直接採取語義動作（例如升壓靈怎麼做齊和噶語義動作）

正如你所看到的，天空真的是極限，而且看起來令人不安的是類似Boo，Nemerle，Lisp等編譯器宏。

可視化表達Trres

現在，升壓原表達式樹可以一般可視化：

從例如從Expressive C++: Playing with Syntax工作我略有延長xpressive中的「Hello World」的例子來顯示錶達樹：

#include <iostream> 
#include <boost/xpressive/xpressive.hpp> 
#include <boost/proto/proto.hpp> 

using namespace boost::xpressive; 

int main() 
{ 
    std::string hello("hello world!"); 

    sregex rex = sregex::compile("(\\w+) (\\w+)!"); 

    // equivalent proto based expression 
    rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!'; 
    boost::proto::display_expr((s1= +_w) >> ' ' >> (s2= +_w) >> '!'); 

    smatch what; 

    if(regex_match(hello, what, rex)) 
    { 
     std::cout << what[0] << '\n'; // whole match 
     std::cout << what[1] << '\n'; // first capture 
     std::cout << what[2] << '\n'; // second capture 
    } 

    return 0; 
}

其輸出接近（注意compiler ABI具體typeid名）：

shift_right(
    shift_right(
     shift_right(
      assign(
       terminal(N5boost9xpressive6detail16mark_placeholderE) 
       , unary_plus(
        terminal(N5boost9xpressive6detail25posix_charset_placeholderE) 
       ) 
      ) 
      , terminal() 
     ) 
     , assign(
      terminal(N5boost9xpressive6detail16mark_placeholderE) 
      , unary_plus(
       terminal(N5boost9xpressive6detail25posix_charset_placeholderE) 
      ) 
     ) 
    ) 
    , terminal(!) 
) 
hello world! 
hello 
world

免責聲明你應該認識到，這是不實際顯示的正則表達式AST，而是一般表達式樹來自Proto的，所以它沒有特定於域（Regex）的信息。我之所以提到它，是因爲它的差異可能會導致更多的工作（除非我找到Xpressive的編譯結構的鉤子），因爲它對於原始問題變得真正有用。

這就是它現在

我會離開這一點，因爲它是吃午飯的時候，我拿起孩子，但是這肯定會吸引我的興趣，所以我打算更晚發佈！

結論/進展報告1.0000001

壞消息馬上：它不會工作。

這是爲什麼。該免責聲明是對的錢。當週末到達時，我已經思考了一些事情，並且「預測」整個事情會在我離開它的地方分解：AST基於原始表達樹（而不是正則表達式matchable_ex）。

這個事實在一些代碼檢查後很快得到確認：在編譯之後，原型表達式樹不再可用來顯示。更不用說，當basic_regex首先被指定爲動態模式時（從來沒有一個原始表達式）。

我一直在一半希望匹配是直接在原型表達樹上實現（使用原型評估/評估上下文），但很快證實情況並非如此。

所以，主要的外賣是：

這一切不會顯示任何正則表達式AST
你可以用上面做的最好的是可視化原表達工作，你必須直接在代碼中創建。這是一種奇特的方式，只需在相同的代碼中手動編寫AST ...

略少於嚴格的觀察包括

升壓原和Boost表現是非常有趣的庫（我並不介意在那裏去釣魚）。我明顯瞭解了一些關於模板元編程庫的重要教訓，特別是這些庫。
很難設計一個構建靜態類型表達式樹的正則表達式解析器。實際上，在一般情況下是不可能的 - 它需要編譯器將所有可能的表達式樹組合實例化到一定的深度。這顯然不會縮放。你可以通過引入多態組合並使用多態調用來解決這個問題，但是這會消除模板元編程（靜態實例化類型/專業化的編譯時優化）的好處。
兩個升壓正則表達式和Boost表現將有可能支持某種形式的正則表達式AST的內部（支持匹配的評估），但
- 尚未暴露/記錄
- 沒有該

_{甚至精神萊克斯支持他們，對這個問題（而不是默認情況下）}

來源

2011-10-05 09:32:34 sehe

+0

添加了一個小的免責聲明以防止任何誤解。這是正在進行中的工作。但它顯示出潛力的真實跡象，IMO – sehe

+0

+1爲努力o.O – ildjarn

+0

gah - 等待週末繼續這項任務;當然，這很快就是這條路的終點......後面看來，我應該看到了。我更新了結論的答案。乾杯! – sehe

25

我再次偶然發現了這個問題。我決定看看用Boost Spirit爲正則表達式語法的重要子集編寫解析器有多困難。

所以，像往常一樣，我開始用筆和紙，過了一段時間後有一些草案的規則。時間繪製類似的AST起來：

namespace ast 
{ 
    struct multiplicity 
    { 
     unsigned minoccurs; 
     boost::optional<unsigned> maxoccurs; 
     bool greedy; 

     multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1) 
      : minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true) 
     { } 

     bool unbounded() const { return !maxoccurs; } 
     bool repeating() const { return !maxoccurs || *maxoccurs > 1; } 
    }; 

    struct charset 
    { 
     bool negated; 

     using range = boost::tuple<char, char>; // from, till 
     using element = boost::variant<char, range>; 

     std::set<element> elements; 
     // TODO: single set for loose elements, simplify() method 
    }; 

    struct start_of_match {}; 
    struct end_of_match {}; 
    struct any_char {}; 
    struct group; 

    typedef boost::variant< // unquantified expression 
     start_of_match, 
     end_of_match, 
     any_char, 
     charset, 
     std::string,   // literal 
     boost::recursive_wrapper<group> // sub expression 
    > simple; 

    struct atom    // quantified simple expression 
    { 
     simple  expr; 
     multiplicity mult; 
    }; 

    using sequence = std::vector<atom>; 
    using alternative = std::vector<sequence>; 
    using regex  = boost::variant<atom, sequence, alternative>; 

    struct group { 
     alternative root; 

     group() = default; 
     group(alternative root) : root(std::move(root)) { } 
    }; 
}

這是你與靈效果很好（由於通過variant和optional與升壓整合，以及具有戰略性的選擇構造函數）典型的AST（58 LOC）。

語法最後我們只略長：

template <typename It> 
    struct parser : qi::grammar<It, ast::alternative()> 
{ 
    parser() : parser::base_type(alternative) 
    { 
     using namespace qi; 
     using phx::construct; 
     using ast::multiplicity; 

     alternative = sequence % '|'; 
     sequence = *atom; 

     simple  = 
         (group) 
        | (charset) 
        | ('.' >> qi::attr(ast::any_char())) 
        | ('^' >> qi::attr(ast::start_of_match())) 
        | ('$' >> qi::attr(ast::end_of_match())) 
        // optimize literal tree nodes by grouping unquantified literal chars 
        | (as_string [ +(literal >> !char_("{?+*")) ]) 
        | (as_string [ literal ]) // lone char/escape + explicit_quantifier 
        ; 

     atom  = (simple >> quantifier); // quantifier may be implicit 

     explicit_quantifier = 
        // bounded ranges: 
         lit('?')         [ _val = construct<multiplicity>(0, 1) ] 
        | ('{' >> uint_ >> '}')     [ _val = construct<multiplicity>(_1, _1) ] 
        // repeating ranges can be marked non-greedy: 
        | (          
          lit('+')        [ _val = construct<multiplicity>(1, boost::none) ] 
         | lit('*')        [ _val = construct<multiplicity>(0, boost::none) ] 
         | ('{' >> uint_ >> ",}")    [ _val = construct<multiplicity>(_1, boost::none) ] 
         | ('{' >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2) ] 
         | ("{," >> uint_ >> '}')    [ _val = construct<multiplicity>(0, _1) ] 
        ) >> -lit('?')  [ phx::bind(&multiplicity::greedy, _val) = false ] 
        ; 

     quantifier = explicit_quantifier | attr(ast::multiplicity()); 

     charset  = '[' 
        >> (lit('^') >> attr(true) | attr(false)) // negated 
        >> *(range | charset_el) 
        > ']' 
        ; 

     range  = charset_el >> '-' >> charset_el; 

     group  = '(' >> alternative >> ')'; 

     literal  = unescape | ~char_("\\+*?.^$|{()") ; 

     unescape = ('\\' > char_); 

     // helper to optionally unescape waiting for raw ']' 
     charset_el = !lit(']') >> (unescape|char_); 
    } 

    private: 
    qi::rule<It, ast::alternative()> alternative; 
    qi::rule<It, ast::sequence()>  sequence; 
    qi::rule<It, ast::atom()>   atom; 
    qi::rule<It, ast::simple()>   simple; 
    qi::rule<It, ast::multiplicity()> explicit_quantifier, quantifier; 
    qi::rule<It, ast::charset()>  charset; 
    qi::rule<It, ast::charset::range()> range; 
    qi::rule<It, ast::group()>   group; 
    qi::rule<It, char()>    literal, unescape, charset_el; 
};

現在，真正的樂趣是做與AST的東西。既然你想要可視化樹，我想從AST生成DOT圖。所以我做：

int main() 
{ 
    std::cout << "digraph common {\n"; 

    for (std::string pattern: { 
      "abc?", 
      "ab+c", 
      "(ab)+c", 
      "[^-a\\-f-z\"\\]aaaa-]?", 
      "abc|d", 
      "a?", 
      ".*?(a|b){,9}?", 
      "(XYZ)|(123)", 
     }) 
    { 
     std::cout << "// ================= " << pattern << " ========\n"; 
     ast::regex tree; 
     if (doParse(pattern, tree)) 
     { 
      check_roundtrip(tree, pattern); 

      regex_todigraph printer(std::cout, pattern); 
      boost::apply_visitor(printer, tree); 
     } 
    } 

    std::cout << "}\n"; 
}

此程序將導致如下圖：

enter image description here

自邊緣描繪重複和顏色指示是否匹配是貪婪（紅色）或非貪婪（藍色）。正如你可以看到我已經優化了AST位爲清楚起見，但（UN）評論相關的行會賺差價：

enter image description here

我想會不會太難調。希望它會成爲某人的靈感。在這個主旨

全碼：https://gist.github.com/sehe/8678988

來源

2014-01-28 23:48:42 sehe

+11

+1被虐狂。我甚至沒有膽量使用正則表達式，更不用說了*爲他們寫一個解析器* – 2014-02-01 22:20:51

+1

這件事很漂亮，不能夠提高足夠的... –

+1

想要感謝你的代碼 - 這是我對Boost.Spirit的核心介紹，現在我正在用我的自己的項目嬉戲...... –

如何獲取正則表達式字符串的AST？

回答

進展報告1

可視化表達Trres

這就是它現在

結論/進展報告1.0000001

相關問題