如何使用boost :: spirit來解析UTF-8？

#include <algorithm> 
#include <iostream> 
#include <string> 
#include <vector> 

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout 

#include <boost/spirit/include/qi.hpp> 
#include <boost/spirit/include/qi_parse.hpp> 
#include <boost/spirit/include/support_standard_wide.hpp> 

void parse_simple_string() 
{ 
    namespace qi = boost::spirit::qi;  
    namespace encoding = boost::spirit::unicode; 
    //namespace stw = boost::spirit::standard_wide; 

    typedef std::wstring::const_iterator iterator_type; 

    std::vector<std::wstring> result; 
    std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)"; 

    qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\"")); 
    qi::phrase_parse(input.begin(), input.end(), 
        key % qi::lit(L"\",\""), 
        encoding::space, 
        result); 

    //std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t> (std::wcout, L"\n")); 
    for(auto const &data : result) std::wcout<<data<<std::endl; 
}

我研究這個帖子How to use Boost Spirit to parse Chinese(unicode utf-16)? 並按照指南，但是無法解析的話「你好」如何使用boost :: spirit來解析UTF-8？

預期的結果應該是

12,3 AB，CD G，G \ 「GG KKK 10，\」 0 PPP 你好

但實際效果是 12,3 AB，CD G，G \ 「GG KKK 10，\」 0 PPP

解析失敗中國字「你好」

OS是64位的win7，我編輯器將文字保存爲UTF-8

來源

2012-12-03 StereoMatching

我很困惑。你是...使用UTF8？爲什麼wstring呢？（UTF8是一個編碼單/雙/三字節序列，右）。我覺得沒有資格更好地解釋，但這是我的感受 – sehe

1-4個字節的不匹配。但是，是的，這是一個相當明顯的不匹配。在引入'char8_t'之前，對於大多數人來說，'char'是UTF-8類型的選擇。 – Puppy

大家都說什麼。在使用UTF-8時，'wstring'是錯誤的。如果您想在Windows上正確編碼UTF-8文字，*特別*，最安全的方法是使用C++ 11文字'u8「blah」（它們尚未在Visual Studio中）或使用右邊的字節轉義直接編碼，即「\ xE4 \ xBD \ xA0 \ xE5 \ xA5 \ xBD」而不是「你好」。 –

如果您在輸入時使用了UTF-8，那麼您可以嘗試使用Boost.Regex中的Unicode Iterators。

例如，使用升壓:: u8_to_u32_iterator：

一個雙向迭代適配器，使的UTF8字符的基礎序列看起來像UTF32字符（只讀）序列。

live demo

#include <boost/regex/pending/unicode_iterator.hpp> 
#include <boost/spirit/include/qi.hpp> 
#include <boost/range.hpp> 
#include <iterator> 
#include <iostream> 
#include <ostream> 
#include <cstdint> 
#include <vector> 

int main() 
{ 
    using namespace boost; 
    using namespace spirit::qi; 
    using namespace std; 

    auto &&utf8_text=u8"你好，世界！"; 
    u8_to_u32_iterator<const char*> 
     tbegin(begin(utf8_text)), tend(end(utf8_text)); 

    vector<uint32_t> result; 
    parse(tbegin, tend, *standard_wide::char_, result); 
    for(auto &&code_point : result) 
     cout << "&#" << code_point << ";"; 
    cout << endl; 
}

輸出是：

&#20320;&#22909;&#65292;&#19990;&#30028;&#65281;&#0;

來源

2013-04-04 19:53:32

+1好的工作先生 – sehe

如何使用boost :: spirit來解析UTF-8？

回答

相關問題