2013-04-18 22 views
4

我試圖從c代碼中刪除註釋和字符串。我只會堅持評論的例子。我有一個滑動窗口,因此在任何特定時刻我只有角色nn-1。如果可能的話,我試圖找出一個算法不使用嵌套whiles,但我需要通過輸入getchar。我的第一個想法是,通過n=* and (n-1)=/然後一直到n=/ and (n-1)=*,但考慮到這一直嵌套,我覺得它是低效的。如果需要,我可以這樣做,但我想知道是否有人有更好的解決方案。使用不帶嵌套while循環的滑動窗口刪除註釋

+4

嘗試制定一個狀態機。即當你遇到一個字符'*'或'/'或'\'或'''或一個單引號時,你根據你以前的狀態更新你的狀態。將一個註釋分隔符'* /'分割爲多行:'* \/n /') –

+0

狀態機可能是將此概念化的最佳方式。在處理'/ * foo * /'style C註釋時,你可能會有四種狀態:'normal','normal-saw-slash','comment'和'comment-seen-star'。 – Will

+7

你需要處理三字母符號嗎?你是否必須在開始註釋的'/'和'*'之間(或C++樣式註釋的'/'和'/',或者是'*'和'/'之間處理backslash-newline C風格評論的結尾)?你是否必須在C++風格評論的末尾處理反斜槓 - 換行符?你是否處理多字符字符常量,例如不啓動註釋的''/ *''?顯然,'「/ *這不是評論* /」'不是評論;它是一個字符串,說它不是一個評論。 (相當於Magritte和他的「Ceci n'est pas un pipe」圖片 - 谷歌吧。) –

回答

2

一個while環書面可能看起來像這樣的算法:

while ((c = getchar()) != EOF) 
{ 
    ... // looking at the byte that was just read 

    if (...) // the symbol is not inside a comment 
    { 
     putchar(c); 
    } 
} 

要確定輸入char是否屬於評論,你可以使用一個狀態機。在下面的例子中,它有4個狀態;還有用於遍歷下一個狀態的規則。

int state = 0; 
int next_state; 
while ((c = getchar()) != EOF) 
{ 
    switch (state) 
    { 
     case 0: next_state = (c == '/' ? 1 : 0); break; 
     case 1: next_state = (c == '*' ? 2 : c == '/' ? 1 : 0); break; 
     case 2: next_state = (c == '*' ? 3 : 2); break; 
     case 3: next_state = (c == '/' ? 0 : c == '*' ? 3 : 2); break; 
     default: next_state = state; // will never happen 
    } 

    if (state == 1 && next_state == 0) 
    { 
     putchar('/'); // for correct output when a slash is not followed by a star 
    } 
    if (state == 0 && next_state == 0) 
    { 
     putchar(c); 
    } 
    state = next_state; 
} 

上面的例子非常簡單:它不正確的/*在非註釋環境工作就像在C字符串;它不支持//評論等。

+0

我最終將擴展這個做字符串,字符和/ /評論。 –

2

正確地做這件事情比起初想象的要複雜得多,正如其他評論所指出的那樣。我強烈建議編寫一個表驅動的FSM,使用狀態轉換圖來正確地轉換。試圖做任何事情超過幾個國家與案件陳述是極其容易出錯的國際海事組織。

下面是一個dot/graphviz格式的圖表,您可以從中直接對狀態表進行編碼。請注意,我沒有測試過這個,所以YMMV。

該圖的語義是,當你看到<ch>,這是一個下降,但如果沒有其他輸入在該狀態匹配。文件結尾是除S0之外的任何狀態中的錯誤,任何未明確列出的字符也是如此,或者<ch>。除了在註釋(S4S5)以及檢測到開始註釋(S1)時,每個掃描的字符都會被打印。檢測到開始註釋時,必須緩衝字符,如果是錯誤開始,則需要打印它們,否則在確定它是真正的註釋時將其扔掉。

在點圖中,sq是單引號',dq是雙引號"

digraph state_machine { 
    rankdir=LR; 
    size="8,5"; 

    node [shape=doublecircle]; S0 /* init */; 
    node [shape=circle]; 

    S0 /* init */  -> S1 /* begin_cmt */ [label = "'/'"]; 
    S0 /* init */  -> S2 /* in_str */ [label = dq]; 
    S0 /* init */  -> S3 /* in_ch */  [label = sq]; 
    S0 /* init */  -> S0 /* init */  [label = "<ch>"]; 
    S1 /* begin_cmt */ -> S4 /* in_slc */ [label = "'/'"]; 
    S1 /* begin_cmt */ -> S5 /* in_mlc */ [label = "'*'"]; 
    S1 /* begin_cmt */ -> S0 /* init */  [label = "<ch>"]; 
    S1 /* begin_cmt */ -> S1 /* begin_cmt */ [label = "'\\n'"]; // handle "/\n/" and "/\n*" 
    S2 /* in_str */ -> S0 /* init */  [label = "'\\'"]; 
    S2 /* in_str */ -> S6 /* str_esc */ [label = "'\\'"]; 
    S2 /* in_str */ -> S2 /* in_str */ [label = "<ch>"]; 
    S3 /* in_ch */  -> S0 /* init */  [label = sq]; 
    S4 /* in_slc */ -> S4 /* in_slc */ [label = "<ch>"]; 
    S4 /* in_slc */ -> S0 /* init */  [label = "'\\n'"]; 
    S5 /* in_mlc */ -> S7 /* end_mlc */ [label = "'*'"]; 
    S5 /* in_mlc */ -> S5 /* in_mlc */ [label = "<ch>"]; 
    S7 /* end_mlc */ -> S7 /* end_mlc */ [label = "'*'|'\\n'"]; 
    S7 /* end_mlc */ -> S0 /* init */  [label = "'/'"]; 
    S7 /* end_mlc */ -> S5 /* in_mlc */ [label = "<ch>"]; 
    S6 /* str_esc */ -> S8 /* oct */  [label = "[0-3]"]; 
    S6 /* str_esc */ -> S9 /* hex */  [label = "'x'"]; 
    S6 /* str_esc */ -> S2 /* in_str */ [label = "<ch>"]; 
    S8 /* oct */  -> S10 /* o1 */  [label = "[0-7]"]; 
    S10 /* o1 */  -> S2 /* in_str */ [label = "[0-7]"]; 
    S9 /* hex */  -> S11 /* h1 */  [label = hex]; 
    S11 /* h1 */  -> S2 /* in_str */ [label = hex]; 
    S3 /* in_ch */  -> S12 /* ch_esc */ [label = "'\\'"]; 
    S3 /* in_ch */  -> S13 /* out_ch */ [label = "<ch>"]; 
    S13 /* out_ch */ -> S0 /* init */  [label = sq]; 
    S12 /* ch_esc */ -> S3 /* in_ch */  [label = sq]; 
    S12 /* ch_esc */ -> S12 /* ch_esc */ [label = "<ch>"]; 
} 
1

既然你只想使用兩個字符的緩衝區,只有一個while循環,我建議第三個字符來跟蹤你的狀態(是否跳過的文字或沒有)。我已經把測試程序爲您在線註釋解釋的邏輯:

// Program to strip comments and strings from a C file 
// 
// Build: 
//  gcc -o strip-comments strip-comments.c 
// 
// Test: 
//  ./strip-comments strip-comments.c 

#include <stdio.h> 
#include <sys/types.h> 
#include <sys/uio.h> 
#include <fcntl.h> 
#include <unistd.h> 
#include <stdlib.h> 

/* The following is a block of strings, and comments for testing 
* the code. 
*/ 
/* test if three comments *//* chained together */// will be removed. 
static int value = 128 /* test comment within valid code *// 2; 
const char * test1 = "This is a test of \" processing"; /* testing inline comment */ 
const char * test2 = "this is a test of \n within strings."; // testing inline comment 
// this is a the last test 


int strip_c_code(FILE * in, FILE * out) 
{ 
    char  buff[2]; 
    char  skipping; 

    skipping = '\0'; 
    buff[0] = '\0'; 
    buff[1] = '\0'; 

    // loop through the file 
    while((buff[0] = fgetc(in)) != EOF) 
    { 
     // checking for start of comment or string block 
     if (!(skipping)) 
     { 
     // start skipping in "//" comments 
     if ((buff[1] == '/') && (buff[0] == '/')) 
      skipping = '/'; 

     // start skipping in "/*" comments 
     else if ((buff[1] == '/') && (buff[0] == '*')) 
      skipping = '*'; 

     // start skipping at start of strings, but not character assignments 
     else if (((buff[1] != '\'') && (buff[0] == '"')) && 
        ((buff[1] != '\\') && (buff[0] == '"'))) 
     { 
      fputc(buff[1], out); 
      skipping = '"'; 
     }; 

     // clear buffer so that processed characters are not interpreted as 
     // end of skip characters. 
     if ((skipping)) 
     { 
      buff[0] = '\0'; 
      buff[1] = '\0'; 
     }; 
     }; 

     // check for characters which terminate skip block 
     switch(skipping) 
     { 
     // if skipping "//" comments, look for new line 
     case '/': 
     if (buff[1] == '\n') 
      skipping = '\0'; 
     break; 

     // if skipping "/*" comments, look for "*/" terminating string 
     case '*': 
     if ((buff[1] == '*') && (buff[0] == '/')) 
     { 
      buff[0] = '\0'; 
      buff[1] = '\0'; 
      skipping = '\0'; 
     }; 
     break; 

     // if skipping strings, look for terminating '"' character 
     case '"': 
     if ((buff[1] != '\\') && (buff[0] == '"')) 
     { 
      skipping = '\0'; 
      buff[0] = '\0'; 
      buff[1] = '\0'; 
      fprintf(out, "NULL"); // replace string with NULL 
     }; 
     break; 

     default: 
     break; 
     }; 

     // if not skipping, write character out 
     if ((!(skipping)) && ((buff[1]))) 
     fputc(buff[1], out); 

     // shift new character to old character position 
     buff[1] = buff[0]; 
    }; 

    // verify that the comment or string was terminated properly 
    if ((skipping)) 
    { 
     fprintf(stderr, "Unterminated comment or string\n"); 
     return(-1); 
    }; 

    // write last character 
    fputc(buff[1], out); 

    return(0); 
} 


int main(int argc, char * argv[]) 
{ 
    FILE * fs; 

    if (argc != 2) 
    { 
     fprintf(stderr, "Usage: %s <filename>\n", argv[0]); 
     return(1); 
    }; 

    if ((fs = fopen(argv[1], "r")) == NULL) 
    { 
     perror("fopen()"); 
     return(1); 
    }; 

    strip_c_code(fs, stdout); 

    fclose(fs); 

    return(0); 
} 

/* end of source file */ 

我也張貼在Github上的代碼,使其更容易下載和編譯:

https://gist.github.com/syzdek/5417109