如果你想在C中編寫一個非常複雜的語法分析器,而不使用任何現有的模式匹配代碼,通常最好是實現一個狀態機,然後通過char處理源代碼char。 Flex + Bison的輸出也只是一個狀態機器。 Flex使用正則表達式將字符串標記爲令牌,然後將令牌傳遞給Bison狀態機,根據機器的當前狀態逐個處理一個令牌。但是您不需要正則表達式標記器,您可以將輸入標記爲狀態機處理的一部分。正則表達式匹配器本身也可以作爲一個狀態機來實現,因此令牌生成可以直接作爲狀態機的一部分。
這裏有一個有趣的鏈接給你;它不是特別多的一般概述狀態機是如何工作的C,但一旦你得到了這個概念,很容易轉置到C代碼:
Parsing command line arguments using a finite state machine and backtracking
這裏有一個超級原始CSV解析器的一些示例代碼:
#include <stdlib.h>
#include <stdio.h>
static char currentToken[4096];
static size_t currentTokenLength;
static
void addCharToCurrentToken (char c) {
if (currentTokenLength < sizeof(currentToken)) {
currentToken[currentTokenLength++] = c;
}
}
static
void printCurrentToken () {
printf("Token: >>>%.*s<<<\n", (int)currentTokenLength, currentToken);
currentTokenLength = 0;
}
typedef enum {
STATE_FindStartOfData,
STATE_FindStartOfToken,
STATE_ParseNumber,
STATE_ParseString,
STATE_CheckEndOfString,
STATE_FindDelimiter,
STATE_ParseError,
STATE_EndOfData
} ParserState;
ParserState parserState = STATE_FindStartOfData;
static
void runTheStateMachine () {
while (parserState != STATE_ParseError
&& parserState != STATE_EndOfData
) {
int c = fgetc(stdin);
// End of data?
if (c == -1) {
switch (parserState) {
case STATE_ParseNumber:
case STATE_CheckEndOfString:
printCurrentToken();
parserState = STATE_EndOfData;
break;
case STATE_ParseString:
// Data ends in the middle of token parsing? No way!
fprintf(stderr, "Data ended abruptly!\n");
parserState = STATE_ParseError;
break;
case STATE_FindStartOfData:
case STATE_FindStartOfToken:
case STATE_FindDelimiter:
// This is okay, data stream may end while in these states
parserState = STATE_EndOfData;
break;
case STATE_ParseError:
case STATE_EndOfData:
break;
}
}
switch (parserState) {
case STATE_FindStartOfData:
// Skip blank lines
if (c == '\n' || c == '\r') break;
// !!!FALLTHROUGH!!!
case STATE_FindStartOfToken:
// Skip overe all whitespace
if (c == ' ' || c == '\t') break;
// Start of string?
if (c == '"') {
parserState = STATE_ParseString;
break;
}
// Blank field?
if (c == ',') {
printCurrentToken();
break;
}
// End of dataset?
if (c == '\n' || c == '\r') {
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Everything else can only be a number
parserState = STATE_ParseNumber;
addCharToCurrentToken(c);
break;
case STATE_ParseNumber:
if (c == ' ' || c == '\t') {
// Numbers cannot contain spaces in the middle,
// so this must be the end of the number.
printCurrentToken();
// We still need to find the real delimiter, though.
parserState = STATE_FindDelimiter;
break;
}
if (c == ',') {
// This time the number ends directly with a delimiter
printCurrentToken();
parserState = STATE_FindStartOfToken;
break;
}
// End of dataset?
if (c == '\n' || c == '\r') {
printCurrentToken();
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Otherwise keep reading the number
addCharToCurrentToken(c);
break;
case STATE_ParseString:
if (c == '"') {
// Either this is the regular end of the string or it is just an
// escaped quotation mark which is doubled ("") in CVS.
parserState = STATE_CheckEndOfString;
break;
}
// All other chars are just treated as ordinary chars
addCharToCurrentToken(c);
break;
case STATE_CheckEndOfString:
if (c == '"') {
// Next char is also a quotation mark,
// so this was not the end of the string.
addCharToCurrentToken(c);
parserState = STATE_ParseString;
break;
}
if (c == ' ' || c == '\t') {
// It was the end of the string
printCurrentToken();
// We still need to find the real delimiter, though.
parserState = STATE_FindDelimiter;
break;
}
if (c == ',') {
// It was the end of the string
printCurrentToken();
// And we even found the delimiter
parserState = STATE_FindStartOfToken;
break;
}
if (c == '\n' || c == '\r') {
// It was the end of the string
printCurrentToken();
// And we even found the end of this dataset
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Everything else is a parse error I guess
fprintf(stderr, "Unexpected char 0x%02X after end of string!\n", c);
parserState = STATE_ParseError;
break;
case STATE_FindDelimiter:
// Delemiter found?
if (c == ',') {
parserState = STATE_FindStartOfToken;
break;
}
// Just skip overe all whitespace
if (c == ' ' || c == '\t') break;
// End of dataset?
if (c == '\n' || c == '\r') {
// And we even found the end of this dataset
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Anything else a pare error I guess
fprintf(stderr, "Unexpected char 0x%02X after end of token!\n", c);
parserState = STATE_ParseError;
break;
case STATE_ParseError:
// Nothing to do
break;
case STATE_EndOfData:
// Nothing to do
break;
}
}
}
int main () {
runTheStateMachine();
return (parserState == STATE_EndOfData ? 0 : 1);
}
的代碼作以下假設:
- 令牌是從來不大於4096個字符。
- 的分隔符是一個逗號
(這就是CVS暗示,但不是所有的CVS文件使用逗號用於這一目的)
- 字符串是總是引用
(通常這是可選的,除非它們包含空格或引號)
- 沒有斷行內引用的字符串
(這通常是允許的)
- 的代碼假定未引述一切都是一個號碼,但不會驗證該號碼的格式是正確的。
此代碼是絕對不能解析任何CSV數據,你給它,但是當你給它的是文件:
"Year","Make","Model" ,"Description", "Price"
1997,"Ford", "E350","ac, abs, moon", 3000.00
1999,"Chevy","Venture ""Extended Edition""",,4900.00
1999,"Chevy", "Venture ""Extended Edition, Very Large""" , , 5000.00
1996,"Jeep", "Grand Cherokee","MUST SELL!"
這將產生輸出:
Token: >>>Year<<<
Token: >>>Make<<<
Token: >>>Model<<<
Token: >>>Description<<<
Token: >>>Price<<<
------------------------------------------
Token: >>>1997<<<
Token: >>>Ford<<<
Token: >>>E350<<<
Token: >>>ac, abs, moon<<<
Token: >>>3000.00<<<
------------------------------------------
Token: >>>1999<<<
Token: >>>Chevy<<<
Token: >>>Venture "Extended Edition"<<<
Token: >>><<<
Token: >>>4900.00<<<
------------------------------------------
Token: >>>1999<<<
Token: >>>Chevy<<<
Token: >>>Venture "Extended Edition, Very Large"<<<
Token: >>><<<
Token: >>>5000.00<<<
------------------------------------------
Token: >>>1996<<<
Token: >>>Jeep<<<
Token: >>>Grand Cherokee<<<
Token: >>>MUST SELL!<<<
------------------------------------------
而且它的唯一應該給你一個想法,你如何解析狀態機的複雜語法。這段代碼遠沒有生產質量,你可以看到,這樣的switch
很快就會變得非常龐大,所以我至少要將狀態碼放入函數中,或者甚至將每個狀態轉換成類似於數據封裝的結構或對象的東西,否則這整件事很快變得難以管理。
有關於「編譯器構造」的書籍。來自N.Wirth的同名名字很容易閱讀。它不會太深,但是一個好的入口。 – Olaf
說:你的問題是離題(拓寬拓寬並要求外部資源) – Olaf
我遇到的問題是,我已經找到更多關於他們如何工作和部分如何構造,並沒有在實際的代碼在他們的經營背後。我是一名視覺學習者,只是看到一個例子,我就能夠向前邁進。 – Sora