識別句子結尾

我試圖讀取一個文本文件，並通過字符串將其輸入到一個向量字符串中。我需要在每個句子結尾處停下來，然後在句子中選出關鍵詞。我知道如何找到關鍵詞，但不知道如何讓它在最後停止輸入字符串。我使用一個while循環來檢查每一行，我用了一系列的考慮，如果語句，如識別句子結尾

if(std::vector<string>::iterator i == ".") i == "\0"

的代碼，我執行了矢量填充到目前爲止是：

std::string c; 
ifstream infile; 
infile.open("example.txt"); 
while(infile >> c){ 
    a.push_back(c); 
}

好了，所以我一直COMME了一個辦法來加載文本文件轉換爲標記每個字，考慮到「」作爲分隔符，並具有特殊情況的單詞列表：

const int MAX_PER_LINE = 512; 
    const int MAX_TOK = 20; 
    const char* const DELIMETER = " -"; 
    const char* const SPECIAL ="!?."; 
    const char* const ignore[] = {"Mr.", "Ms.","Mrs.","sr.", "Ave.", "Rd."};

然後

   if(!file.good()){ 
     return 1; 
    } 
    //parsing algorithm paraphrased from cs.dvc.edu/HowTo_Parse.html 
    while(!file.eof()){ 
    char line[MAX_PER_LINE]; 

    file.getline(line, MAX_PER_LINE); 
    int n = 0; 
    const char* token[MAX_TOK] = {}; 
    token[0] = strtok(line, DELIMETER); 
    if(token[0]){ 
     for(n = 1; n < MAX_TOK; ++n){ 
      token[n] = strtok(0, DELIMETER); 
      if(!token[n]) break; 
     } 
    } 
    //for(int i = 0; i < n; ++i){ 
    for(int i = 0; i < n; ++i){ 
     cout << "Token[" << i << "] =" << token[i] << endl; 
     cout << endl; 
    } 
    }

現在我找一個放什麼在if語句，這樣它會檢查每個令牌的特殊情況，或者如果他們遵循令牌具有特殊的情況下，將它們加載到新的集合標記中。我大部分都知道僞代碼，但是我不知道用什麼語法來處理，如果（token [i]包含特殊情況或者token [i]在它之前沒有任何東西（第一個令牌）或資本化，遵循一個特殊的情況下，令牌將其加載到一個新的令牌。

任何幫助，將不勝感激。

來源

2013-10-27 user2325795

句末通常附在單詞上。它不會在向量中顯示爲它自己的字符串。 –

這樣做很好，這是一項不重要的任務。 @DavidSchwartz已經給出了一個很容易得到的指示，但它在某些時候仍然會出錯，比如包含縮寫的句子。例如，承認「黃先生去了S. Broad大街119號」。因爲單個句子而不是三個句子不太容易。 –

嗯。我明白你的意思。那麼我真的不知道該怎麼去做。 – user2325795

發現，在一個週期結束的話是非常容易的，只需檢查是否爲word.back() == '.'。如果字符串爲空，back()是未定義的行爲，您還需要先檢查word.empty()。如果您的編譯器不支持C++ 11，則還可以使用word[word.size() - 1] == '.'。

下面是使用與結尾的單詞天真地分割句子一個基本的例子。「」：

#include <iostream> 
#include <string> 
#include <vector> 

int main(int argc, char** argv) { 
    if (argc == 1) { 
     std::cerr << "Usage: " << argv[0] << " [text to split]\n" 
      << "Splits the input text into one sentence per line." << std::endl; 
     return 1; 
    } 

    std::vector<std::string> sentences; 
    std::string current_sentence; 
    for (int i = 1; i < argc; ++i) { 
     std::string word(argv[i]); 
     current_sentence.append(word); 
     current_sentence.push_back(' '); 
     /* use word.back() == '.' for C++11 */ 
     if (!word.empty() && word[word.size() - 1] == '.') { 
      sentences.push_back(current_sentence); 
      current_sentence.clear(); 
     } 
    } 
    if (!current_sentence.empty()) { 
     sentences.push_back(current_sentence); 
    } 

    for (size_t i = 0; i < sentences.size(); ++i) { 
     std::cout << sentences[i] << std::endl; 
    } 
    return 0; 
}

運行，如：

$ g++ test.cpp 
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond. 
This is a test. 
And a second sentence. 
So we meet again Mr. 
Bond.

注意它是怎麼想的‘先生’是一個結束句子。

我不知道處理這個問題的巧妙方法，但是一個（脆弱的）選項是創建一個不是句子結尾的單詞列表，然後檢查單詞是否在列表中，像這樣：

#include <algorithm> 
#include <iostream> 
#include <set> 
#include <string> 
#include <vector> 

const std::string tmp[] = { 
    "dr.", 
    "mr.", 
    "mrs.", 
    "ms.", 
    "rd.", 
    "st." 
}; 
const std::set<std::string> ABBREVIATIONS(tmp, tmp + sizeof(tmp)/sizeof(tmp[0])); 

bool has_period(const std::string& word) { 
    return !word.empty() && word[word.size() - 1] == '.'; 
} 

bool is_abbreviation(std::string word) { 
    /* Convert to lowercase, so we don't need to check every possible 
    * variation of each word. Remove this (and update the set initialization) 
    * if you don't care about handling poor grammar. */ 
    std::transform(word.begin(), word.end(), word.begin(), ::tolower); 

    /* Check if the word is an abbreviation. */ 
    return ABBREVIATIONS.find(word) != ABBREVIATIONS.end(); 
} 

int main(int argc, char** argv) { 
    if (argc == 1) { 
     std::cerr << "Usage: " << argv[0] << " [text to split]\n" 
      << "Splits the input text into one sentence per line." << std::endl; 
     return 1; 
    } 

    std::vector<std::string> sentences; 
    std::string current_sentence; 
    for (int i = 1; i < argc; ++i) { 
     std::string word(argv[i]); 
     current_sentence.append(word); 
     current_sentence.push_back(' '); 
     if (has_period(word) && !is_abbreviation(word)) { 
      sentences.push_back(current_sentence); 
      current_sentence.clear(); 
     } 
    } 
    if (!current_sentence.empty()) { 
     sentences.push_back(current_sentence); 
    } 

    for (size_t i = 0; i < sentences.size(); ++i) { 
     std::cout << sentences[i] << std::endl; 
    } 
    return 0; 
}

在C++ 11，你可以用它unordered_set更高效，更簡單的使用std::string::back和更容易初始化（std::set<std::string> PERIOD_WORDS = { "dr.", "mr.", "mrs." /*etc.*/ }）。

運行此版本：

$ g++ test.cpp 
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond. 
This is a test. 
And a second sentence. 
So we meet again Mr. Bond.

當然但是，它仍然無法趕上我們沒有明確的程序在任何情況下：

$ ./a.out Example Ave. is just north of here. 
Example Ave. 
is just north of here.

即使我們補充說，要檢測像「我住在Example Ave.」這樣的案例是非常困難的，句子以縮寫結尾。儘管如此，我希望這是有幫助的。

編輯：我剛纔讀的評論鏈接到sentence breaking Wikipedia article，這將是比較容易納入規則：

（c）如果下一個標記是大寫的，那麼它結束一句話。

喜歡的東西：

#include <algorithm> 
#include <iostream> 
#include <set> 
#include <string> 
#include <vector> 

const std::string tmp[] = { 
    "ave.", 
    "dr.", 
    "mr.", 
    "mrs.", 
    "ms.", 
    "rd.", 
    "st." 
}; 
const std::set<std::string> PERIOD_WORDS(tmp, tmp + sizeof(tmp)/sizeof(tmp[0])); 

bool has_period(const std::string& word) { 
    return !word.empty() && word[word.size() - 1] == '.'; 
} 

bool is_abbreviation(std::string word) { 
    /* Convert to lowercase, so we don't need to check every possible 
    * variation of each word. Remove this (and update the set initialization) 
    * if you don't care about handling poor grammar. */ 
    std::transform(word.begin(), word.end(), word.begin(), ::tolower); 

    /* Check if the word is a word that ends with a period. */ 
    return PERIOD_WORDS.find(word) != PERIOD_WORDS.end(); 
} 

bool is_capitalized(const std::string& word) { 
    return !word.empty() && std::isupper(word[0]); 
} 

int main(int argc, char** argv) { 
    if (argc == 1) { 
     std::cerr << "Usage: " << argv[0] << " [text to split]\n" 
      << "Splits the input text into one sentence per line." << std::endl; 
     return 1; 
    } 

    std::vector<std::string> sentences; 
    std::string current_sentence; 
    for (int i = 1; i < argc; ++i) { 
     std::string word(argv[i]); 
     std::string next_word(i + 1 < argc ? argv[i + 1] : ""); 
     current_sentence.append(word); 
     current_sentence.push_back(' '); 
     if (next_word.empty() 
      || has_period(word) 
      && (!is_abbreviation(word) || is_capitalized(next_word))) { 
      sentences.push_back(current_sentence); 
      current_sentence.clear(); 
     } 
    } 

    for (size_t i = 0; i < sentences.size(); ++i) { 
     std::cout << sentences[i] << std::endl; 
    } 
    return 0; 
}

再連情況下，像這樣的工作：

$ ./a.out Example Ave. is just north of here. I live on Example Ave. Test test test. 
Example Ave. is just north of here. 
I live on Example Ave. 
Test test test.

但它仍然不能處理某些情況：

$ ./a.out Mr. Adams lives on Example Ave. Example Ave. is just north of here. I live on Example Ave. Test test test. 
Mr. 
Adams lives on Example Ave. 
Example Ave. is just north of here. 
I live on Example Ave. 
Test test test.

來源

2013-10-27 01:48:31

這是，謝謝。 – user2325795

你如何使用這樣的Argv [i]？ – user2325795

0.o你甚至如何將該文本加載到程序中？ – user2325795

編寫您自己的句子分隔符適用於小項目或沒有任何項目的項目rnationalization。對於基於文本邊界的高級文本解決方案，我建議ICU的BreakIterator。基於unicode.org標準化，它們提供了字符，單詞，換行符和句子邊界。他們有C++的開源庫（以及我認爲的Java）。參照this page，它有鏈接到圖書館的下載頁面。

這樣可以避免重新發明輪子並避免以後出現潛在問題。大多數領先的出版軟件產品如QuarkXPress等都使用這個庫。

編輯：我試圖找到對句子邊界ICU的的BreakIterator使用一個快速的教程，但我發現單詞邊界的例子 - （句子邊界的計算是非常相似的，可能需要只是createSentenceInstance在下面代替createWordInstance）

void listWordBoundaries(const UnicodeString& s) { 
    UErrorCode status = U_ZERO_ERROR; 
    BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status); 


    bi->setText(s); 
    int32_t p = bi->first(); 
    while (p != BreakIterator::DONE) { 
     printf("Boundary at position %d\n", p); 
     p = bi->next(); 
    } 
    delete bi; 
}

來源

2013-10-27 02:26:16 Ashok

識別句子結尾

回答

相關問題