發現,在一個週期結束的話是非常容易的,只需檢查是否爲word.back() == '.'
。如果字符串爲空,back()
是未定義的行爲,您還需要先檢查word.empty()
。如果您的編譯器不支持C++ 11,則還可以使用word[word.size() - 1] == '.'
。
下面是使用與結尾的單詞天真地分割句子一個基本的例子。「」:
#include <iostream>
#include <string>
#include <vector>
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]\n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
current_sentence.append(word);
current_sentence.push_back(' ');
/* use word.back() == '.' for C++11 */
if (!word.empty() && word[word.size() - 1] == '.') {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
if (!current_sentence.empty()) {
sentences.push_back(current_sentence);
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
運行,如:
$ g++ test.cpp
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond.
This is a test.
And a second sentence.
So we meet again Mr.
Bond.
注意它是怎麼想的‘先生’是一個結束句子。
我不知道處理這個問題的巧妙方法,但是一個(脆弱的)選項是創建一個不是句子結尾的單詞列表,然後檢查單詞是否在列表中,像這樣:
#include <algorithm>
#include <iostream>
#include <set>
#include <string>
#include <vector>
const std::string tmp[] = {
"dr.",
"mr.",
"mrs.",
"ms.",
"rd.",
"st."
};
const std::set<std::string> ABBREVIATIONS(tmp, tmp + sizeof(tmp)/sizeof(tmp[0]));
bool has_period(const std::string& word) {
return !word.empty() && word[word.size() - 1] == '.';
}
bool is_abbreviation(std::string word) {
/* Convert to lowercase, so we don't need to check every possible
* variation of each word. Remove this (and update the set initialization)
* if you don't care about handling poor grammar. */
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
/* Check if the word is an abbreviation. */
return ABBREVIATIONS.find(word) != ABBREVIATIONS.end();
}
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]\n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
current_sentence.append(word);
current_sentence.push_back(' ');
if (has_period(word) && !is_abbreviation(word)) {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
if (!current_sentence.empty()) {
sentences.push_back(current_sentence);
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
在C++ 11,你可以用它unordered_set
更高效,更簡單的使用std::string::back
和更容易初始化(std::set<std::string> PERIOD_WORDS = { "dr.", "mr.", "mrs." /*etc.*/ }
)。
運行此版本:
$ g++ test.cpp
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond.
This is a test.
And a second sentence.
So we meet again Mr. Bond.
當然但是,它仍然無法趕上我們沒有明確的程序在任何情況下:
$ ./a.out Example Ave. is just north of here.
Example Ave.
is just north of here.
即使我們補充說,要檢測像「我住在Example Ave.」這樣的案例是非常困難的,句子以縮寫結尾。儘管如此,我希望這是有幫助的。
編輯:我剛纔讀的評論鏈接到sentence breaking Wikipedia article,這將是比較容易納入規則:
(c)如果下一個標記是大寫的,那麼它結束一句話。
喜歡的東西:
#include <algorithm>
#include <iostream>
#include <set>
#include <string>
#include <vector>
const std::string tmp[] = {
"ave.",
"dr.",
"mr.",
"mrs.",
"ms.",
"rd.",
"st."
};
const std::set<std::string> PERIOD_WORDS(tmp, tmp + sizeof(tmp)/sizeof(tmp[0]));
bool has_period(const std::string& word) {
return !word.empty() && word[word.size() - 1] == '.';
}
bool is_abbreviation(std::string word) {
/* Convert to lowercase, so we don't need to check every possible
* variation of each word. Remove this (and update the set initialization)
* if you don't care about handling poor grammar. */
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
/* Check if the word is a word that ends with a period. */
return PERIOD_WORDS.find(word) != PERIOD_WORDS.end();
}
bool is_capitalized(const std::string& word) {
return !word.empty() && std::isupper(word[0]);
}
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]\n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
std::string next_word(i + 1 < argc ? argv[i + 1] : "");
current_sentence.append(word);
current_sentence.push_back(' ');
if (next_word.empty()
|| has_period(word)
&& (!is_abbreviation(word) || is_capitalized(next_word))) {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
再連情況下,像這樣的工作:
$ ./a.out Example Ave. is just north of here. I live on Example Ave. Test test test.
Example Ave. is just north of here.
I live on Example Ave.
Test test test.
但它仍然不能處理某些情況:
$ ./a.out Mr. Adams lives on Example Ave. Example Ave. is just north of here. I live on Example Ave. Test test test.
Mr.
Adams lives on Example Ave.
Example Ave. is just north of here.
I live on Example Ave.
Test test test.
句末通常附在單詞上。它不會在向量中顯示爲它自己的字符串。 –
這樣做很好,這是一項不重要的任務。 @DavidSchwartz已經給出了一個很容易得到的指示,但它在某些時候仍然會出錯,比如包含縮寫的句子。例如,承認「黃先生去了S. Broad大街119號」。因爲單個句子而不是三個句子不太容易。 –
嗯。我明白你的意思。那麼我真的不知道該怎麼去做。 – user2325795