使用C/C++解析來自解析文本的名詞短語

我想從解析文本中解析名詞短語（NN，NNP，NNS，NNPS）。例如： -使用C/C++解析來自解析文本的名詞短語

Input sentence - 
John/NNP 
works/VBZ 
in/IN 
oil/NN 
industry/NN 
./. 
Output: John Oil Industry

我感到困惑的邏輯，因爲我需要搜索字符串，例如/NN，/NNP，/NNS和/NNPS和之前打印上一個字。使用C或C++解析名詞短語的邏輯是什麼？

我自己嘗試是以下幾點：

char* SplitString(char* str, char sep 
{ 
    return str; 
} 
main() 
{ 
    char* input = "John/NNP works/VBZ in/IN oil/NN industry/NN ./."; 
    char *output, *temp; 
    char * field; 
    char sep = '/NNP'; 
    int cnt = 1; 
    output = SplitString(input, sep); 

    field = output; 
    for(temp = field; *temp; ++temp){ 
     if (*temp == sep){ 
      printf(" %.*s\n", temp-field, field); 
      field = temp+1; 
     } 
    } 
    printf("%.*s\n", temp-field, field); 
}

我的修改如下：

#include <regex> 
#include <iostream> 

int main() 
{ 
    const std::string s = "John/NNP works/VBZ in/IN oil/NNS industry/NNPS ./."; 
    std::regex rgx("(\\w+)\/NN[P-S]{0,2}"); 
    std::smatch match; 

    if (std::regex_search(s.begin(), s.end(), match, rgx)) 
     std::cout << " " << match[1] << '\n'; 
}

我得到的輸出是唯一的「約翰」。其他/ NNS標籤不會來。

我的第二個辦法：

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <assert.h> 

char** str_split(char* a_str, const char a_delim) 
{ 
    char** result = 0; 
    size_t count = 0; 
    char* tmp = a_str; 
    char* last_comma = 0; 
    char delim[2]; 
    delim[0] = a_delim; 
    delim[1] = 0; 

    /* Count how many elements will be extracted. */ 
    while (*tmp) 
    { 
     if (a_delim == *tmp) 
     { 
      count++; 
      last_comma = tmp; 
     } 
     tmp++; 
    } 

    /* Add space for trailing token. */ 
    count += last_comma < (a_str + strlen(a_str) - 1); 

    /* Add space for terminating null string so caller 
     knows where the list of returned strings ends. */ 
    count++; 

    result = malloc(sizeof(char*) * count); 

    if (result) 
    { 
     size_t idx = 0; 
     char* token = strtok(a_str, delim); 

     while (token) 
     { 
      assert(idx < count); 
      *(result + idx++) = strdup(token); 
      token = strtok(0, delim); 
     } 
     assert(idx == count - 1); 
     *(result + idx) = 0; 
    } 

    return result; 
} 

int main() 
{ 
    char text[] = "John/NNP works/VBZ in/IN oil/NN industry/NN ./."; 
    char** tokens; 

    //printf("INPUT SENTENCE=[%s]\n\n", text); 

    tokens = str_split(text, ''); 

    if (tokens) 
    { 
     int i; 
     for (i = 0; *(tokens + i); i++) 
     { 
      printf("[%s]\n", *(tokens + i)); 
      free(*(tokens + i)); 
     } 
     printf("\n"); 
     free(tokens); 
    } 

    return 0; 
}

與輸出是：

[John/NNP] 
[works/VBZ] 
[in/IN] 
[oil/NN] 
[industry/NN] 
[./.]

我只想/NNP和/NN解析數據，即John，oil和industry。如何得到這個？將正則表達式的幫助？如何在C中使用正則表達式與C++相同？

來源

2015-11-06 New_Programmer

我對這個邏輯感到困惑。我正在嘗試搜索/ NN，/ NNP，/ NNS和/ NNPS等字符串，然後在「/」之前打印所有字符，直到獲得空格。 –

@New_Programmer應該關於工作。 – Magisch

@Haris不，它不被稱爲自然語言處理。它是一個簡單的解析問題。 – Identity1

你行「爲結尾標記添加空間」是不必要的，因爲strtok將在終止零自動結束。

此外，tokens = str_split(text, '');不能正確的，因爲你的str_split預計，a_delim一個字符，你''，這對我的編譯器（鏘）發出

error: empty character constant

想必你的意思是分裂的錯誤餵它一個空間' '，但我沒有測試它本身是否可行。（即使你得到某種形式的輸出的反正。）

您的代碼返回結果[John/NNP]（等），因爲你沒有做別的拆斷的標籤名稱，你也沒有測試對你的希望列表標籤。一個C程序只做你所說的 - 這就是編程的目的。

我建議在普通的C一個直接的解決方案，使用字符串標記化功能strtok，單個字符的查找strchr，只有字符串比較strcmp。

我的日常標記化在空格輸入字符串，分裂掉一個字在上空格的時間（注：這個工作，strtok需要能夠修改輸入的字符串），定位斜槓在此令牌中，比較斜槓後面的文本與所需短語的列表，並且輸出斜槓之前的單詞（如果它在列表中）。

strtok每個呼叫之後，指針token將指向下一個字，它已經將是零封端的開始。因此，第一個令牌將是John/NNP。
然後strchr試圖找到斜槓，如果找到，將把它的位置置於slash。
如果成功，slash指向斜線本身;所以，測試標籤應該在slash+1。
一個簡單的循環將其與wanted列表中的每個標籤名稱進行比較。如果找到，*slash設置爲0，覆蓋斜槓，因此當前令牌字符串在其之前結束。然後輸出當前令牌。
無論是否找到，strtok都會在循環中再次調用，直到失敗。如果它成功找到下一個標記，它將回滾到＃2，否則退出。

這一計劃的

#include <stdio.h> 
#include <string.h> 

int main() 
{ 
    /* input */ 
    char text[] = "John/NNP works/VBZ in/IN oil/NN industry/NN ./."; 
    char *wanted[] = { "NN", "NNP", "NNS", "NNPS" }; 

    /* helper variables */ 
    size_t i; 
    char *token, *slash; 

    token = strtok(text, " "); 
    while (token) 
    { 
     slash = strchr (token, '/'); 
     if (slash && slash[1]) 
     { 
      for (i=0; i<sizeof(wanted)/sizeof(wanted[0]); i++) 
      { 
       if (!strcmp (slash+1, wanted[i])) 
       { 
        *slash = 0; 
        printf ("%s\n", token); 
        break; 
       } 
      } 
     } 
     token = strtok(NULL, " "); 
    } 

    return 0; 
}

輸出：

John 
oil 
industry

我沒有刻意去把握的話，按您所需的輸出。這是一個微不足道的附錄，你應該能夠自己解決這個問題。

來源

2015-11-07 00:26:02 usr2564301

如果所有關於打印，然後嘗試這種方法。它在搜索功能中使用regular expression來查找是否有一個模式\/NN[A-Z]{0,3}即/ NN後跟0到3個大寫字母並捕獲()之前的\\w+單詞。

這是未經測試，但：

#include <regex> 
#include <iostream> 

int main() 
{ 
    const std::string s = "John/NNP works/VBZ in/IN oil/NN industry/NN ./."; 
    std::regex rgx("(\\w+)\/NN[A-Z]{0,3}"); 
    std::smatch match; 

    while (std::regex_search(s, match, rgx)) 
     std::cout << "match: " << match[1] << '\n'; 
}

來源

2015-11-06 09:12:04 Identity1

1：它只顯示「John」作爲輸出。我試圖解析所有四個/ NNP，/ NN，/ NNS和/ NNPS類型的名詞短語。無論如何感謝代碼片段。我明白了。讓我繼續嘗試。謝謝。 –

while循環在循環中打印「John」。 –

是的我沒有在CPP中完成正則表達式，因此不確定如何使範圍全局 – Identity1

regex_token_iterator可能會有所幫助

std::string input = "John/NNP works/VBZ in/IN oil/NN industry/NN ABC/NNPS ./."; 

    // This regex has a capture group() that is looking for a sequence of word characters 
    // followed by /NN which is not captured but just matched 
    std::regex nouns_re("(\\w+)\\/NN"); 

    // We pass 1 as the final argument to the token iterator 
    // because we just want to print the word captured and not the /NN part 
    std::copy(std::sregex_token_iterator(input.begin(), input.end(), nouns_re, 1), 
       std::sregex_token_iterator(), 
       std::ostream_iterator<std::string>(std::cout, "\n") 
     );

來源

2015-11-06 23:52:45

使用C/C++解析來自解析文本的名詞短語

回答

相關問題