嘗試使用空格和引號來標記字符串

所以我一直在研究這一點，並且我遇到了一些奇怪的問題。最終目標是通過空格和引號將輸入字符串分開（即，「這是一個非常」非常複雜「的例子，這是一個非常非常複雜的例子）。現在看來，除了第一個字符串之外，它正確地將其分開。嘗試使用空格和引號來標記字符串

這（BUFF被傳遞與來自函數getline的值）：

char **tokens = (char **)malloc(sizeof(char)); 
char *temp; 
int count = 0; 
int prev = 0; 
// Get tokens 
for (int i = 0; i <= strlen(command) && running; i++) { 
    if (i > prev && strncmp((buff + i), " ", 1) == 0) { 
     temp = (char **)realloc(tokens, (sizeof(char)) * WORD_SIZE * (++count)); 
     if (temp == NULL) { 
      fprintf(stderr, "Error in parsing: ran out of memory\n"); 
      running = false; 
      free(tokens); 
     } 
     else { 
      tokens = temp; 
      *(temp) = (buff + i); 
      strncpy(*(temp), "\0", 1); 
      temp = tokens + WORD_SIZE * (count - 1); 
      *(temp) = buff+prev; 
      prev = i+1; 
     } 
    } 
    else if (strncmp((buff + i), "\"", 1) == 0) { 
     *(temp) = (buff + i); 
      strncpy(*(temp), "\0", 1); 
     i++; 
     prev = i; 
     for (; strncmp((buff + i), "\"", 1) != 0; i++) { } 
     temp = (char **)realloc(tokens, (sizeof(char)) * WORD_SIZE * (++count)); 
     if (temp == NULL) { 
      fprintf(stderr, "Error in parsing: ran out of memory\n"); 
      running = false; 
      free(tokens); 
     } 
     else { 
      tokens = temp; 
      *(temp) = (buff + i); 
      strncpy(*(temp), "\0", 1); 
      temp = tokens + WORD_SIZE * (count - 1); 
      *(temp) = buff+prev; 
      prev = i+1; 
     } 
    } 
    else if (strncmp((buff + i), "\0", 1) == 0) { 
     temp = (char **)realloc(tokens, (sizeof(char)) * WORD_SIZE * (++count)); 
     if (temp == NULL) { 
      fprintf(stderr, "Error in parsing: ran out of memory\n"); 
      running = false; 
      free(tokens); 
     } 
     else { 
      tokens = temp; 
      temp = tokens + WORD_SIZE * (count - 1); 
      *(temp) = buff+prev; 
      prev = i+1; 
     } 
    } 
} 
for (int i = 0; i < count; i++) 
    printf("\t%i: %s\n", i, *tokens + sizeof(char) * WORD_SIZE * i);

現在，如果我輸入「這是一個測試」（不包括引號），我得到：
0：
1：
2：一個
3：測試

報價多一點搞砸了，因爲「這個\」是\ 「非常\ 」非常複雜的\「測試」我得到：
0：
1：是一個
2：
3：非常複雜
4：測試

來源

2014-05-10 Michael

首先OBSN。（但 - 可能 - 與你的問題無關）：你需要在'char ** tokens =（char **）malloc（sizeof（char））;'中分配sizeof（char *）''，而不是'sizeof char）'（通常是'1'）。（另外，不需要在C中使用'malloc'） – usr2564301

@Jongware是的，我繼續改變它，但是在添加每個標記（包括第一次）之前重新分配指針，所以malloc實際上只是一個形式上，它的大小並不重要。 – Michael

抓好@Jongware。實際上'sizeof（char）'是_always_ 1. – Gene

這裏是從頭開始一個全新的寫，因爲這是比較容易重新編寫自己的代碼（如果道歉那不是你的意圖）。一些注意事項：

無需測試以前的malloc s。你可以安全地用realloc一個NULL指針。
if (strncmp((buff + i), "\"", 1) == 0) - 您可以立即測試buff[i]。
爲什麼所有那prev洗牌？ :)這足以讓在您的字符串上循環。
我離開temp測試成功realloc因爲你也有。在我的代碼中實際上是沒有必要的，因爲它只是退出main。
增加：字符"也引入了一個新的「單詞」，當沒有一個空格。

代碼：

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 

int main (void) 
{ 
    char **tokens = NULL; 
    int i, count = 0, strcount; 
    char **temp, *iterate; 

    char *input = "this \"is a\" very \"very complex\" test"; 

    iterate = input; 

    if (iterate) 
    { 
     while (*iterate) 
     { 
      while (*iterate == ' ') 
       iterate++; 

      if (!*iterate) 
       break; 

      temp = realloc(tokens, sizeof(char *) * (count+1)); 
      if (temp == NULL) 
      { 
       fprintf(stderr, "Error in parsing: ran out of memory\n"); 
       return -1; 
      } 
      tokens = temp; 

      if (*iterate == '\"') 
      { 
       iterate++; 
       strcount = 0; 
       while (iterate[strcount] && iterate[strcount] != '\"') 
        strcount++; 
       tokens[count] = malloc(strcount+1); 
       strncpy (tokens[count], iterate, strcount); 
       tokens[count][strcount] = 0; 
       count++; 
       iterate += strcount; 
       if (*iterate == '\"') 
        iterate++; 
      } else 
      { 
       strcount = 0; 
       while (iterate[strcount] && iterate[strcount] != ' ' && iterate[strcount] != '\"') 
        strcount++; 
       tokens[count] = malloc(strcount+1); 
       strncpy (tokens[count], iterate, strcount); 
       tokens[count][strcount] = 0; 
       count++; 
       iterate += strcount; 
      } 
     } while (*iterate); 
    } 

    for (i = 0; i < count; i++) 
     printf("\t%i: %s\n", i, tokens[i]); 

    return 0; 
}

輸出爲this "is a" very "very complex" test：

0: this 
1: is a 
2: very 
3: very complex 
4: test

來源

2014-05-11 00:38:26 usr2564301

謝謝，不幸的是我沒有最好的教授，所以95％的知識都來自於我自己。這意味着有很多我想念的東西，比如你可以索引指針。 – Michael

@邁克爾：在這種情況下，+1顯示的努力！至少這是一次勇敢的嘗試。 – usr2564301

是的，我真的不喜歡向非概念性問題尋求幫助，但我覺得我只是在做錯事。我足夠了解我的代碼在我修改時變得多麼糟糕。 – Michael

你說的替代碼會好起來的。如果您使用確定性有限自動機模型來思考它們，簡單字符串解析算法幾乎總是更容易，併產生更多可維護代碼。網上有很多關於DFA的免費參考資料。

以下是解決您問題的DFA。

dfa

的意義[任何]爲「一切」。換句話說，如果沒有其他轉換匹配，就拿這個。它成爲C switch中的default案件。 [eos]的含義是「字符串結尾」或空字符。

請注意，DFA可讓您對所有案例進行系統化處理，例如在單詞中間出現引號。在這裏我把它當作當前單詞的結尾和新引用單詞的開頭。如果規範發生變化，則DFA很容易更改，而且這些更改將轉化爲代碼，而不需要費力思維。

剩下的就是添加「action code」來捕獲令牌開始，並在明顯的地方覆蓋空終止符。在C，我們有：

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 

char **tokenize(char *str, int *n_tokens_rtn) 
{ 
    // State of the DFA. 
    enum { Error = -1, Start, InQuoted, InWord } state = Start; 

    // String pointer and current character 
    int cp = 0; 

#define CURRENT_CHAR (str[cp]) 
#define ADVANCE_TO_NEXT_CHAR do { ++cp; } while (0) 
#define MARK_END_OF_TOKEN do { str[cp] = '\0'; } while (0) 

    // Token pointer and buffer. Allocate biggest possible and shrink at end. 
    int tp = 0; 
    char **tokens = safe_malloc((1 + strlen(str)/2) * sizeof *tokens); 

#define SAVE_TOKEN do { tokens[tp++] = &str[cp]; } while (0) 

    // Each iteration is one DFA transition. 
    for (;;) { 
    switch (state) { 
    case Start: 
     switch (CURRENT_CHAR) { 
     case '\0': 
     goto done_scanning; 

     case ' ': case '\t': case '\n': 
     ADVANCE_TO_NEXT_CHAR; 
     break; 

     case '"': 
     state = InQuoted; 
     ADVANCE_TO_NEXT_CHAR; 
     SAVE_TOKEN; 
     break; 

     default: 
     state = InWord; 
     SAVE_TOKEN; 
     ADVANCE_TO_NEXT_CHAR; 
     break; 
     } 
     break; 

    case InQuoted: 
     switch (CURRENT_CHAR) { 
     case '\0': 
     state = Error; // Missing close quote. 
     break; 

     case '"': 
     state = Start; 
     MARK_END_OF_TOKEN; 
     ADVANCE_TO_NEXT_CHAR; 
     break; 

     default: 
     ADVANCE_TO_NEXT_CHAR; 
     break; 
     } 
     break; 

    case InWord: 
     switch (CURRENT_CHAR) { 

     case '\0': 
     goto done_scanning; 

     case ' ': case '\t': case '\n': 
     state = Start; 
     MARK_END_OF_TOKEN; 
     ADVANCE_TO_NEXT_CHAR; 
     break; 

     case '"': // Word ended in quote, not space. 
     state = InQuoted; 
     MARK_END_OF_TOKEN; 
     ADVANCE_TO_NEXT_CHAR; 
     SAVE_TOKEN; 
     break; 

     default: 
     ADVANCE_TO_NEXT_CHAR; 
     break; 
     } 
     break; 

    case Error: 
     fprintf(stderr, "Syntax error.\n"); 
     goto done_scanning; 
    } 
    } 

done_scanning: 
    // Return number of tokens if caller is interested. 
    if (n_tokens_rtn) *n_tokens_rtn = tp; 

    // Append a null terminator for good measure. 
    tokens[tp++] = NULL; 

    // Trim the returned value to the right size. 
    return realloc(tokens, tp * sizeof *tokens); 
} 

int main(void) 
{ 
    char str[] = "this \"is a\" very \"very complex\" example"; 
    char **tokens = tokenize(str, NULL); 
    for (int i = 0; tokens[i]; i++) 
    printf("%s\n", tokens[i]); 
    return 0; 
}

來源

2014-05-11 02:41:40 Gene

從狀態InQuoted，一個'''應該可能到InWord，而不是開始（至少如果你想要像shell一樣處理混合引用/不引用的東西 - 類似'ab「cd」ef「是單個標記' abc def'）。 –

@ChrisDodd謝謝。我將一個非引用引號邊界作爲標記分隔符對待，他沒有指定.DFA的好處是它或多或少地強制您明確地決定哪些特殊代碼通常不適用DFA的 – Gene

+1除了欣賞解決問題和實現方面的問題外，我還在CS理論課上，所以很高興看到他們的實際應用 – Michael

這看起來像一個比較簡單的問題，所以不是寫一個完整的解析器，我使用標準C庫做繁重的寫了一個解決方案。如果此解決方案具有吸引力，請自行判斷。可能有些方法可以改善我所做的工作，以使代碼更清晰一些，但我會將其留作任何傾向於此的人的練習。

#include <stdlib.h> 
#include <stdio.h> 
#include <string.h> 

int main() 
{ 
    char input_string[] = "this \"is a\" very \"very complex\" test"; 
    char **tokens = NULL; 
    int token_count = 0; 
    char *ptr = input_string; 
    int i; 
    char *next_ptr = ptr; 

    while (*ptr && next_ptr) 
    { 
     while (*ptr == ' ') ptr++; 
     tokens = realloc(tokens, ++token_count * sizeof(char *)); 
     if (tokens == NULL) 
      return -1; 
     if (*ptr == '"') 
      next_ptr = strchr(ptr+1, '"'); 
     else 
      next_ptr = strpbrk(ptr, " \""); 
     if (next_ptr) 
     { 
      tokens[token_count-1] = malloc(sizeof(char) * (next_ptr - (ptr+(*ptr=='"'))) + 1); 
      if (tokens[token_count-1] == NULL) 
       return -1; 
      strncpy(tokens[token_count-1], (ptr+(*ptr=='"')), next_ptr - (ptr+(*ptr=='"'))); 
      tokens[token_count-1][next_ptr - (ptr+(*ptr=='"'))] = 0; 
      ptr = next_ptr + (*ptr=='"'); 
     } 
     else 
      tokens[token_count-1] = strdup(ptr+(*ptr=='"')); 
    } 

    for (i = 0; i < token_count; ++i) 
     printf("[%d]: %s\n", i, tokens[i]); 

    return 0; 
}

輸出：

[0]: this 
[1]: is a 
[2]: very 
[3]: very complex 
[4]: test

來源

2014-05-11 05:00:19

嘗試使用空格和引號來標記字符串

回答

相關問題