解碼UTF8實體爲UTF8 C++

我有UTF8實體的字符串（我不知道我把它命名爲右）：解碼UTF8實體爲UTF8 C++

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441";

我怎麼能轉換成更具可讀性？我用G ++幾個小時的std ::的codecvt人工掏挖與C++ 11的支持，但之後我沒有得到任何結果：

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441"; 

wstring_convert<codecvt_utf8_utf16<char16_t>,char16_t> convert; 
string dest = convert.to_bytes(std);

回報噩夢堆棧跟蹤開始：

error: no matching function for call to ‘std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>::to_bytes(std::string&)

我希望有是另一種方式。

來源

2016-11-24 James May

你看到的不是實體，而是代碼點。您正在通過Unicode轉義序列定義字符，編譯器會自動將它們轉換爲UTF-8。將其轉換成UTF-16和反之亦然的典型方法是這樣的：

static std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter; 
std::string ws2s(const std::wstring &wstr) { 
    std::string narrow = converter.to_bytes(wstr); 
    return narrow; 
} 

std::wstring s2ws(const std::string &str) { 
    std::wstring wide = converter.from_bytes(str); 
    return wide; 
}

當然你也可以不是原來的字符串轉換爲同一類型（的std :: string）的另一個字符串，因爲它無法容納這樣的字符。這就是爲什麼編譯器首先將UTF-16代碼轉換爲UTF-8的原因。

來源

2016-11-24 19:35:03

我很確定這些函數對'\ u'表示法沒有任何線索。 – tadman

他們不需要。編譯器會這樣做，因爲它可以處理字符串中的Unicode序列。如果OP想要在字符串中保留原始Unicode轉義序列，他會使用'\\ u0418'等（我的答案會不同）。 –

首先，您使用std::wstring_convert是倒退。你有一個UTF-8編碼std::string，你想要轉換成一個寬的Unicode字符串。由於to_bytes()不包含std::string作爲輸入，因此您將收到編譯器錯誤。它需要一個std::wstring_convert::wide_string作爲輸入（這是你的情況std::u16string，由於你在專業化運用char16_t），所以你需要使用from_bytes()而不是to_bytes()：

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441"; 

std::wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t> convert; 
std::u16string dest = convert.from_bytes(std);

現在，他這樣說，第9所述JSON specification狀態：

9 String

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.

\" represents the quotation mark character (U+0022).

\\ represents the reverse solidus character (U+005C).

\/ represents the solidus character (U+002F).

\b represents the backspace character (U+0008).

\f represents the form feed character (U+000C).

\n represents the line feed character (U+000A).

\r represents the carriage return character (U+000D).

\t represents the character tabulation character (U+0009).

So, for example, a string containing only a single reverse solidus character may be represented as " \\ ".

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u , followed by four hexadecimal digits that encode the code point. Hexadecimal digits can be digits (U+0030 through U+0039) or the hexadecimal letters A through F in uppercase (U+0041 through U+0046) or lowercase (U+0061 through U+0066). So, for example, a string containing only a single reverse solidus character may be represented as " \u005C ".

The following four cases all produce the same result:

" \u002F "

" \u002f "

" \/ "

" / "

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as " \uD834\uDD1E ".

原始JSON數據本身可以是UTF-8（最常見的編碼）進行編碼，UTF-16等，但無論使用的編碼的，字符序列"\u0418\u043d\u0434\u0435\u043a\u0441"表示UTF-16碼單元序列U+0418 U+043d U+0434 U+0435 U+043a U+0441，這是Unicode字符串"Индекс"。

如果您使用實際的JSON解析器（如JSON for Modern C++，jsoncpp, RapidJSON等），它將爲您解析UTF-16 codeunit值並返回可讀的Unicode字符串。

但是，如果您手動處理JSON數據，則必須手動解碼任何\x和\uXXXX轉義序列。 std::wstring_convert不能爲你做。它只能將std::string的JSON轉換爲std::wstring/std:::u16string，如果這樣可以更輕鬆地解析數據。但是，您仍然需要分別解析JSON的內容。

之後，如果需要，可以使用std::wstring_convert將提取的任何std::wstring/std::u16string字符串轉換回UTF-8以節省內存。

來源

2016-11-25 20:26:58

我很樂意爲現代C++使用JSON，但是當我嘗試用它解析json時，我得到一個錯誤：what（）：parse error - unexpected' '。代碼只是：auto j3 = json :: parse（json_string）; –

' '是Unicode碼點'U + FFFD REPLACEMENT CHARACTER'（UTF-8字節序列'0xEF 0xBF 0xBD'）。您的JSON數據在傳遞給JSON解析器之前，可能會使用錯誤的字符集進行字符集轉換。「現代C++的JSON」解析器只接受有效的UTF-8輸入。 –

解碼UTF8實體爲UTF8 C++

回答

相關問題