2013-10-27 106 views
0

我試圖讀取一個文本文件,並通過字符串將其輸入到一個向量字符串中。我需要在每個句子結尾處停下來,然後在句子中選出關鍵詞。我知道如何找到關鍵詞,但不知道如何讓它在最後停止輸入字符串。我使用一個while循環來檢查每一行,我用了一系列的考慮,如果語句,如識別句子結尾

if(std::vector<string>::iterator i == ".") i == "\0" 

的代碼,我執行了矢量填充到目前爲止是:

std::string c; 
ifstream infile; 
infile.open("example.txt"); 
while(infile >> c){ 
    a.push_back(c); 
} 




好了,所以我一直COMME了一個辦法來加載文本文件轉換爲標記每個字,考慮到「」作爲分隔符,並具有特殊情況的單詞列表:

const int MAX_PER_LINE = 512; 
    const int MAX_TOK = 20; 
    const char* const DELIMETER = " -"; 
    const char* const SPECIAL ="!?."; 
    const char* const ignore[] = {"Mr.", "Ms.","Mrs.","sr.", "Ave.", "Rd."}; 

然後

   if(!file.good()){ 
     return 1; 
    } 
    //parsing algorithm paraphrased from cs.dvc.edu/HowTo_Parse.html 
    while(!file.eof()){ 
    char line[MAX_PER_LINE]; 

    file.getline(line, MAX_PER_LINE); 
    int n = 0; 
    const char* token[MAX_TOK] = {}; 
    token[0] = strtok(line, DELIMETER); 
    if(token[0]){ 
     for(n = 1; n < MAX_TOK; ++n){ 
      token[n] = strtok(0, DELIMETER); 
      if(!token[n]) break; 
     } 
    } 
    //for(int i = 0; i < n; ++i){ 
    for(int i = 0; i < n; ++i){ 
     cout << "Token[" << i << "] =" << token[i] << endl; 
     cout << endl; 
    } 
    } 

現在我找一個放什麼在if語句,這樣它會檢查每個令牌的特殊情況,或者如果他們遵循令牌具有特殊的情況下,將它們加載到新的集合標記中。我大部分都知道僞代碼,但是我不知道用什麼語法來處理,如果(token [i]包含特殊情況或者token [i]在它之前沒有任何東西(第一個令牌)或資本化,遵循一個特殊的情況下,令牌將其加載到一個新的令牌。

任何幫助,將不勝感激。

+0

句末通常附在單詞上。它不會在向量中顯示爲它自己的字符串。 –

+0

這樣做很好,這是一項不重要的任務。 @DavidSchwartz已經給出了一個很容易得到的指示,但它在某些時候仍然會出錯,比如包含縮寫的句子。例如,承認「黃先生去了S. Broad大街119號」。因爲單個句子而不是三個句子不太容易。 –

+0

嗯。我明白你的意思。那麼我真的不知道該怎麼去做。 – user2325795

回答

0

發現,在一個週期結束的話是非常容易的,只需檢查是否爲word.back() == '.'。如果字符串爲空,back()是未定義的行爲,您還需要先檢查word.empty()。如果您的編譯器不支持C++ 11,則還可以使用word[word.size() - 1] == '.'

下面是使用與結尾的單詞天真地分割句子一個基本的例子。「」:

#include <iostream> 
#include <string> 
#include <vector> 

int main(int argc, char** argv) { 
    if (argc == 1) { 
     std::cerr << "Usage: " << argv[0] << " [text to split]\n" 
      << "Splits the input text into one sentence per line." << std::endl; 
     return 1; 
    } 

    std::vector<std::string> sentences; 
    std::string current_sentence; 
    for (int i = 1; i < argc; ++i) { 
     std::string word(argv[i]); 
     current_sentence.append(word); 
     current_sentence.push_back(' '); 
     /* use word.back() == '.' for C++11 */ 
     if (!word.empty() && word[word.size() - 1] == '.') { 
      sentences.push_back(current_sentence); 
      current_sentence.clear(); 
     } 
    } 
    if (!current_sentence.empty()) { 
     sentences.push_back(current_sentence); 
    } 

    for (size_t i = 0; i < sentences.size(); ++i) { 
     std::cout << sentences[i] << std::endl; 
    } 
    return 0; 
} 

運行,如:

$ g++ test.cpp 
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond. 
This is a test. 
And a second sentence. 
So we meet again Mr. 
Bond. 

注意它是怎麼想的‘先生’是一個結束句子。

我不知道處理這個問題的巧妙方法,但是一個(脆弱的)選項是創建一個不是句子結尾的單詞列表,然後檢查單詞是否在列表中,像這樣:

#include <algorithm> 
#include <iostream> 
#include <set> 
#include <string> 
#include <vector> 

const std::string tmp[] = { 
    "dr.", 
    "mr.", 
    "mrs.", 
    "ms.", 
    "rd.", 
    "st." 
}; 
const std::set<std::string> ABBREVIATIONS(tmp, tmp + sizeof(tmp)/sizeof(tmp[0])); 

bool has_period(const std::string& word) { 
    return !word.empty() && word[word.size() - 1] == '.'; 
} 

bool is_abbreviation(std::string word) { 
    /* Convert to lowercase, so we don't need to check every possible 
    * variation of each word. Remove this (and update the set initialization) 
    * if you don't care about handling poor grammar. */ 
    std::transform(word.begin(), word.end(), word.begin(), ::tolower); 

    /* Check if the word is an abbreviation. */ 
    return ABBREVIATIONS.find(word) != ABBREVIATIONS.end(); 
} 

int main(int argc, char** argv) { 
    if (argc == 1) { 
     std::cerr << "Usage: " << argv[0] << " [text to split]\n" 
      << "Splits the input text into one sentence per line." << std::endl; 
     return 1; 
    } 

    std::vector<std::string> sentences; 
    std::string current_sentence; 
    for (int i = 1; i < argc; ++i) { 
     std::string word(argv[i]); 
     current_sentence.append(word); 
     current_sentence.push_back(' '); 
     if (has_period(word) && !is_abbreviation(word)) { 
      sentences.push_back(current_sentence); 
      current_sentence.clear(); 
     } 
    } 
    if (!current_sentence.empty()) { 
     sentences.push_back(current_sentence); 
    } 

    for (size_t i = 0; i < sentences.size(); ++i) { 
     std::cout << sentences[i] << std::endl; 
    } 
    return 0; 
} 

在C++ 11,你可以用它unordered_set更高效,更簡單的使用std::string::back和更容易初始化(std::set<std::string> PERIOD_WORDS = { "dr.", "mr.", "mrs." /*etc.*/ })。

運行此版本:

$ g++ test.cpp 
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond. 
This is a test. 
And a second sentence. 
So we meet again Mr. Bond. 

當然但是,它仍然無法趕上我們沒有明確的程序在任何情況下:

$ ./a.out Example Ave. is just north of here. 
Example Ave. 
is just north of here. 

即使我們補充說,要檢測像「我住在Example Ave.」這樣的案例是非常困難的,句子以縮寫結尾。儘管如此,我希望這是有幫助的。


編輯:我剛纔讀的評論鏈接到sentence breaking Wikipedia article,這將是比較容易納入規則:

(c)如果下一個標記是大寫的,那麼它結束一句話。

喜歡的東西:

#include <algorithm> 
#include <iostream> 
#include <set> 
#include <string> 
#include <vector> 

const std::string tmp[] = { 
    "ave.", 
    "dr.", 
    "mr.", 
    "mrs.", 
    "ms.", 
    "rd.", 
    "st." 
}; 
const std::set<std::string> PERIOD_WORDS(tmp, tmp + sizeof(tmp)/sizeof(tmp[0])); 

bool has_period(const std::string& word) { 
    return !word.empty() && word[word.size() - 1] == '.'; 
} 

bool is_abbreviation(std::string word) { 
    /* Convert to lowercase, so we don't need to check every possible 
    * variation of each word. Remove this (and update the set initialization) 
    * if you don't care about handling poor grammar. */ 
    std::transform(word.begin(), word.end(), word.begin(), ::tolower); 

    /* Check if the word is a word that ends with a period. */ 
    return PERIOD_WORDS.find(word) != PERIOD_WORDS.end(); 
} 

bool is_capitalized(const std::string& word) { 
    return !word.empty() && std::isupper(word[0]); 
} 

int main(int argc, char** argv) { 
    if (argc == 1) { 
     std::cerr << "Usage: " << argv[0] << " [text to split]\n" 
      << "Splits the input text into one sentence per line." << std::endl; 
     return 1; 
    } 

    std::vector<std::string> sentences; 
    std::string current_sentence; 
    for (int i = 1; i < argc; ++i) { 
     std::string word(argv[i]); 
     std::string next_word(i + 1 < argc ? argv[i + 1] : ""); 
     current_sentence.append(word); 
     current_sentence.push_back(' '); 
     if (next_word.empty() 
      || has_period(word) 
      && (!is_abbreviation(word) || is_capitalized(next_word))) { 
      sentences.push_back(current_sentence); 
      current_sentence.clear(); 
     } 
    } 

    for (size_t i = 0; i < sentences.size(); ++i) { 
     std::cout << sentences[i] << std::endl; 
    } 
    return 0; 
} 

再連情況下,像這樣的工作:

$ ./a.out Example Ave. is just north of here. I live on Example Ave. Test test test. 
Example Ave. is just north of here. 
I live on Example Ave. 
Test test test. 

但它仍然不能處理某些情況:

$ ./a.out Mr. Adams lives on Example Ave. Example Ave. is just north of here. I live on Example Ave. Test test test. 
Mr. 
Adams lives on Example Ave. 
Example Ave. is just north of here. 
I live on Example Ave. 
Test test test. 
+0

這是,謝謝。 – user2325795

+0

你如何使用這樣的Argv [i]? – user2325795

+0

0.o你甚至如何將該文本加載到程序中? – user2325795

2

編寫您自己的句子分隔符適用於小項目或沒有任何項目的項目rnationalization。 對於基於文本邊界的高級文本解決方案,我建議ICU的BreakIterator。基於unicode.org標準化,它們提供了字符,單詞,換行符和句子邊界。他們有C++的開源庫(以及我認爲的Java)。 參照this page,它有鏈接到圖書館的下載頁面。

這樣可以避免重新發明輪子並避免以後出現潛在問題。大多數領先的出版軟件產品如QuarkXPress等都使用這個庫。

編輯: 我試圖找到對句子邊界ICU的的BreakIterator使用一個快速的教程,但我發現單詞邊界的例子 - (句子邊界的計算是非常相似的,可能需要只是createSentenceInstance在下面代替createWordInstance

void listWordBoundaries(const UnicodeString& s) { 
    UErrorCode status = U_ZERO_ERROR; 
    BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status); 


    bi->setText(s); 
    int32_t p = bi->first(); 
    while (p != BreakIterator::DONE) { 
     printf("Boundary at position %d\n", p); 
     p = bi->next(); 
    } 
    delete bi; 
}