boost :: tokenizer vs boost :: split

我想解析每個'^'字符的C++字符串到一個向量標記。我一直使用boost :: split方法，但我現在正在編寫性能關鍵代碼，並想知道哪一個可以提供更好的性能。boost :: tokenizer vs boost :: split

例如：

string message = "A^B^C^D"; 
vector<string> tokens; 
boost::split(tokens, message, boost::is_any_of("^"));

與

boost::char_separator<char> sep("^"); 
boost::tokenizer<boost::char_separator<char> > tokens(text, sep);

哪一個會提供更好的性能，爲什麼？

謝謝。

來源

2011-10-28 czchlong

+32

簡介它，你告訴我們。 –

第二個看起來好像沒有做任何內存分配，所以我猜想這會更快。但只有一種方法可以確定。 –

[Boost.Spirit]（http://www.boost.org/libs/spirit/）。[齊]（http://www.boost.org/libs/spirit/doc/html/spirit/qi/tutorials /quick_start.html）將大幅超越兩者。 – ildjarn

最好的選擇取決於幾個因素。如果您只需掃描一次令牌，那麼boost :: tokenizer在運行時和空間性能方面都是不錯的選擇（這些令牌向量可能會佔用大量空間，具體取決於輸入數據）。

如果你要經常掃描令牌，或者需要一個高效的隨機訪問向量，那麼boost :: split可能是更好的選擇。

例如，在你的「A^B^C^...^Z」輸入字符串，其中令牌是長度爲1個字節，該方法boost::split/vector<string>會消耗至少 2 * N-1個字節。使用字符串存儲在大多數STL實現中的方式，您可以計算出其計數超過8倍。將這些字符串存儲在向量中的內存和時間是昂貴的。

我跑我的機器，並擁有1000萬個令牌類似的模式快速測試是這樣的：

的boost ::分裂= 2.5秒和〜620MB
的boost ::標記生成器= 0.9S和0MB

如果你只是在做一時間s可以的令牌，那麼顯然這個令牌化器是更好的。但是，如果您在應用程序的整個生命週期中粉碎成要重複使用的結構，則可能首選擁有一個令牌向量。

如果你想去的矢量路線，那麼我建議不要使用vector<string>，而是一個字符串::迭代器的向量。只需將其分成一對迭代器，並保留大量的令牌以供參考。例如：

using namespace std; 
vector<pair<string::const_iterator,string::const_iterator> > tokens; 
boost::split(tokens, s, boost::is_any_of("^")); 
for(auto beg=tokens.begin(); beg!=tokens.end();++beg){ 
    cout << string(beg->first,beg->second) << endl; 
}

這個改進版採用相同的服務器和測試上1.6秒和390MB。並且，該向量的所有內存開銷與令牌的數量成線性關係 - 不以任何方式依賴於令牌的長度，而是存儲每個令牌。

來源

2011-12-31 08:46:46 brandx

我發現使用clang++ -O3 -std=c++11 -stdlib=libc++的結果相當不同。

首先我提取用逗號沒有換行符分離成巨大的字符串〜470K字的文本文件中，像這樣：

path const inputPath("input.txt"); 

filebuf buf; 
buf.open(inputPath.string(),ios::in); 
if (!buf.is_open()) 
    return cerr << "can't open" << endl, 1; 

string str(filesystem::file_size(inputPath),'\0'); 
buf.sgetn(&str[0], str.size()); 
buf.close();

然後我跑的各種定時測試結果存儲到清除預尺寸矢量運行之間，例如，

void vectorStorage(string const& str) 
{ 
    static size_t const expectedSize = 471785; 

    vector<string> contents; 
    contents.reserve(expectedSize+1); 

    ... 

    { 
     timed _("split is_any_of"); 
     split(contents, str, is_any_of(",")); 
    } 
    if (expectedSize != contents.size()) throw runtime_error("bad size"); 
    contents.clear(); 

    ... 
}

作爲參考，定時器只是這樣的：

struct timed 
{ 
    ~timed() 
    { 
     auto duration = chrono::duration_cast<chrono::duration<double, ratio<1,1000>>>(chrono::high_resolution_clock::now() - start_); 

     cout << setw(40) << right << name_ << ": " << duration.count() << " ms" << endl; 
    } 

    timed(std::string name="") : 
     name_(name) 
    {} 


    chrono::high_resolution_clock::time_point const start_ = chrono::high_resolution_clock::now(); 
    string const name_; 
};

我也計時一次迭代（無向量）。下面是結果：

Vector: 
           hand-coded: 54.8777 ms 
         split is_any_of: 67.7232 ms 
        split is_from_range: 49.0215 ms 
           tokenizer: 119.37 ms 
One iteration: 
           tokenizer: 97.2867 ms 
          split iterator: 26.5444 ms 
      split iterator back_inserter: 57.7194 ms 
       split iterator char copy: 34.8381 ms

記號賦予這麼多慢比split，一個迭代數字甚至還不包括字符串拷貝：

{ 
    string word; 
    word.reserve(128); 

    timed _("tokenizer"); 
    boost::char_separator<char> sep(","); 
    boost::tokenizer<boost::char_separator<char> > tokens(str, sep); 

    for (auto range : tokens) 
    {} 
} 

{ 
    string word; 

    timed _("split iterator"); 
    for (auto it = make_split_iterator(str, token_finder(is_from_range(',', ','))); 
     it != decltype(it)(); ++it) 
    { 
     word = move(copy_range<string>(*it)); 
    } 
}

毫不含糊的結論：使用split。

來源

2012-12-22 16:35:48 frobenius

這可能取決於你的提升版本，以及如何你的功能。

我們有一些邏輯性能問題這是使用boost ::分裂1.41.0處理幾千或幾十萬小串（預計小於10代幣）。當我通過性能分析器運行代碼時，我們發現在boost :: split中花費了驚人的39％的時間。

我們嘗試了一些簡單的「修復」這並不會影響性能實質性像「我們知道我們不會有在每次通過10餘項，所以預設的矢量10個項目。」

由於我們實際上並不需要向量，只能迭代令牌並完成相同的工作，我們將代碼更改爲boost :: tokenize，代碼的相同部分降至運行時的1％。

來源

2015-09-21 20:42:38

boost :: tokenizer vs boost :: split

回答

相關問題