2016-11-11 46 views
0

背景信息:
我有一種編程語言的願望,知道這樣做的工具,我沒有任何關於如何使用它們的好例子。我真的不想使用Flex或Bison,因爲它沒有教導我覺得創建編譯器所需的抽象。我有創建字符串,將它們標記化,將它們提供給作爲語法的文件並解析最終創建實際程序來運行該語言的概念。問題是,我不知道如何編寫標記器或解析器。我有一般的想法,但是當我看到例子時我會更好地理解。如果有人可以發佈一個/幾個例子,那會很棒!在C中編寫語法標記器/解析器的最佳方式是什麼?

我的問題如下:
能怎麼辦有人張貼的例子寫在C語法標記生成器/分析器?

+1

有關於「編譯器構造」的書籍。來自N.Wirth的同名名字很容易閱讀。它不會太深,但是一個好的入口。 – Olaf

+0

說:你的問題是離題(拓寬拓寬並要求外部資源) – Olaf

+0

我遇到的問題是,我已經找到更多關於他們如何工作和部分如何構造,並沒有在實際的代碼在他們的經營背後。我是一名視覺學習者,只是看到一個例子,我就能夠向前邁進。 – Sora

回答

0

如果你想在C中編寫一個非常複雜的語法分析器,而不使用任何現有的模式匹配代碼,通常最好是實現一個狀態機,然後通過char處理源代碼char。 Flex + Bison的輸出也只是一個狀態機器。 Flex使用正則表達式將字符串標記爲令牌,然後將令牌傳遞給Bison狀態機,根據機器的當前狀態逐個處理一個令牌。但是您不需要正則表達式標記器,您可以將輸入標記爲狀態機處理的一部分。正則表達式匹配器本身也可以作爲一個狀態機來實現,因此令牌生成可以直接作爲狀態機的一部分。

這裏有一個有趣的鏈接給你;它不是特別多的一般概述狀態機是如何工作的C,但一旦你得到了這個概念,很容易轉置到C代碼:

Parsing command line arguments using a finite state machine and backtracking

這裏有一個超級原始CSV解析器的一些示例代碼:

#include <stdlib.h> 
#include <stdio.h> 

static char currentToken[4096]; 
static size_t currentTokenLength; 

static 
void addCharToCurrentToken (char c) { 
    if (currentTokenLength < sizeof(currentToken)) { 
     currentToken[currentTokenLength++] = c; 
    } 
} 

static 
void printCurrentToken () { 
    printf("Token: >>>%.*s<<<\n", (int)currentTokenLength, currentToken); 
    currentTokenLength = 0; 
} 


typedef enum { 
    STATE_FindStartOfData, 
    STATE_FindStartOfToken, 
    STATE_ParseNumber, 
    STATE_ParseString, 
    STATE_CheckEndOfString, 
    STATE_FindDelimiter, 
    STATE_ParseError, 
    STATE_EndOfData 
} ParserState; 


ParserState parserState = STATE_FindStartOfData; 


static 
void runTheStateMachine () { 
    while (parserState != STATE_ParseError 
      && parserState != STATE_EndOfData 
    ) { 
     int c = fgetc(stdin); 
     // End of data? 
     if (c == -1) { 
      switch (parserState) { 
       case STATE_ParseNumber: 
       case STATE_CheckEndOfString: 
        printCurrentToken(); 
        parserState = STATE_EndOfData; 
        break; 

       case STATE_ParseString: 
        // Data ends in the middle of token parsing? No way! 
        fprintf(stderr, "Data ended abruptly!\n"); 
        parserState = STATE_ParseError; 
        break; 

       case STATE_FindStartOfData: 
       case STATE_FindStartOfToken: 
       case STATE_FindDelimiter: 
        // This is okay, data stream may end while in these states 
        parserState = STATE_EndOfData; 
        break; 

       case STATE_ParseError: 
       case STATE_EndOfData: 
        break; 
      } 
     } 

     switch (parserState) { 
       case STATE_FindStartOfData: 
        // Skip blank lines 
        if (c == '\n' || c == '\r') break; 
        // !!!FALLTHROUGH!!! 

       case STATE_FindStartOfToken: 
        // Skip overe all whitespace 
        if (c == ' ' || c == '\t') break; 
        // Start of string? 
        if (c == '"') { 
         parserState = STATE_ParseString; 
         break; 
        } 
        // Blank field? 
        if (c == ',') { 
         printCurrentToken(); 
         break; 
        } 
        // End of dataset? 
        if (c == '\n' || c == '\r') { 
         printf("------------------------------------------\n"); 
         parserState = STATE_FindStartOfData; 
         break; 
        } 
        // Everything else can only be a number 
        parserState = STATE_ParseNumber; 
        addCharToCurrentToken(c); 
        break; 

       case STATE_ParseNumber: 
        if (c == ' ' || c == '\t') { 
         // Numbers cannot contain spaces in the middle, 
         // so this must be the end of the number. 
         printCurrentToken(); 
         // We still need to find the real delimiter, though. 
         parserState = STATE_FindDelimiter; 
         break; 
        } 
        if (c == ',') { 
         // This time the number ends directly with a delimiter 
         printCurrentToken(); 
         parserState = STATE_FindStartOfToken; 
         break; 
        } 
        // End of dataset? 
        if (c == '\n' || c == '\r') { 
         printCurrentToken(); 
         printf("------------------------------------------\n"); 
         parserState = STATE_FindStartOfData; 
         break; 
        } 
        // Otherwise keep reading the number 
        addCharToCurrentToken(c); 
        break; 

       case STATE_ParseString: 
        if (c == '"') { 
         // Either this is the regular end of the string or it is just an 
         // escaped quotation mark which is doubled ("") in CVS. 
         parserState = STATE_CheckEndOfString; 
         break; 
        } 
        // All other chars are just treated as ordinary chars 
        addCharToCurrentToken(c); 
        break; 

       case STATE_CheckEndOfString: 
        if (c == '"') { 
         // Next char is also a quotation mark, 
         // so this was not the end of the string. 
         addCharToCurrentToken(c); 
         parserState = STATE_ParseString; 
         break; 
        } 
        if (c == ' ' || c == '\t') { 
         // It was the end of the string 
         printCurrentToken(); 
         // We still need to find the real delimiter, though. 
         parserState = STATE_FindDelimiter; 
         break; 
        } 
        if (c == ',') { 
         // It was the end of the string 
         printCurrentToken(); 
         // And we even found the delimiter 
         parserState = STATE_FindStartOfToken; 
         break; 
        } 
        if (c == '\n' || c == '\r') { 
         // It was the end of the string 
         printCurrentToken(); 
         // And we even found the end of this dataset 
         printf("------------------------------------------\n"); 
         parserState = STATE_FindStartOfData; 
         break; 
        } 
        // Everything else is a parse error I guess 
        fprintf(stderr, "Unexpected char 0x%02X after end of string!\n", c); 
        parserState = STATE_ParseError; 
        break; 

       case STATE_FindDelimiter: 
        // Delemiter found? 
        if (c == ',') { 
         parserState = STATE_FindStartOfToken; 
         break; 
        } 
        // Just skip overe all whitespace 
        if (c == ' ' || c == '\t') break; 
        // End of dataset? 
        if (c == '\n' || c == '\r') { 
         // And we even found the end of this dataset 
         printf("------------------------------------------\n"); 
         parserState = STATE_FindStartOfData; 
         break; 
        } 
        // Anything else a pare error I guess 
        fprintf(stderr, "Unexpected char 0x%02X after end of token!\n", c); 
        parserState = STATE_ParseError; 
        break; 

       case STATE_ParseError: 
        // Nothing to do 
        break; 

       case STATE_EndOfData: 
        // Nothing to do 
        break; 
     } 
    } 
} 

int main () { 
    runTheStateMachine(); 
    return (parserState == STATE_EndOfData ? 0 : 1); 
} 

的代碼作以下假設:

  • 令牌是從來不大於4096個字符。
  • 的分隔符是一個逗號
    (這就是CVS暗示,但不是所有的CVS文件使用逗號用於這一目的)
  • 字符串是總是引用
    (通常這是可選的,除非它們包含空格或引號)
  • 沒有斷行內引用的字符串
    (這通常是允許的)
  • 的代碼假定未引述一切都是一個號碼,但不會驗證該號碼的格式是正確的。

此代碼是絕對不能解析任何CSV數據,你給它,但是當你給它的是文件:

"Year","Make","Model" ,"Description", "Price" 
    1997,"Ford", "E350","ac, abs, moon", 3000.00 
1999,"Chevy","Venture ""Extended Edition""",,4900.00 
1999,"Chevy",  "Venture ""Extended Edition, Very Large""" , , 5000.00 
1996,"Jeep", "Grand Cherokee","MUST SELL!" 

這將產生輸出:

Token: >>>Year<<< 
Token: >>>Make<<< 
Token: >>>Model<<< 
Token: >>>Description<<< 
Token: >>>Price<<< 
------------------------------------------ 
Token: >>>1997<<< 
Token: >>>Ford<<< 
Token: >>>E350<<< 
Token: >>>ac, abs, moon<<< 
Token: >>>3000.00<<< 
------------------------------------------ 
Token: >>>1999<<< 
Token: >>>Chevy<<< 
Token: >>>Venture "Extended Edition"<<< 
Token: >>><<< 
Token: >>>4900.00<<< 
------------------------------------------ 
Token: >>>1999<<< 
Token: >>>Chevy<<< 
Token: >>>Venture "Extended Edition, Very Large"<<< 
Token: >>><<< 
Token: >>>5000.00<<< 
------------------------------------------ 
Token: >>>1996<<< 
Token: >>>Jeep<<< 
Token: >>>Grand Cherokee<<< 
Token: >>>MUST SELL!<<< 
------------------------------------------ 

而且它的唯一應該給你一個想法,你如何解析狀態機的複雜語法。這段代碼遠沒有生產質量,你可以看到,這樣的switch很快就會變得非常龐大,所以我至少要將狀態碼放入函數中,或者甚至將每個狀態轉換成類似於數據封裝的結構或對象的東西,否則這整件事很快變得難以管理。

+0

這必須是最好的例子,更不用提我最有用的信息了從這裏和其他地方的任何人收到,非常感謝你 – Sora

+0

@Sora原本我打算寫一個JSON解析器 - 不管你信不信,但這也不是太複雜,但你需要一個狀態堆棧。當您切換到新狀態時,您必須將最後一個狀態推送到該堆棧,以便稍後可以拉動並返回到該狀態。例如。在解析一個字符串之後,你需要知道你是否正在解析一個對象或一個數組,如果一個對象,你只是解析一個鍵或值?而實現這樣的堆棧只會使代碼複雜化,所以我去了CVS,因爲它不需要這樣的堆棧:-) – Mecki

相關問題