首先,您使用std::wstring_convert
是倒退。你有一個UTF-8編碼std::string
,你想要轉換成一個寬的Unicode字符串。由於to_bytes()
不包含std::string
作爲輸入,因此您將收到編譯器錯誤。它需要一個std::wstring_convert::wide_string
作爲輸入(這是你的情況std::u16string
,由於你在專業化運用char16_t
),所以你需要使用from_bytes()
而不是to_bytes()
:
std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441";
std::wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t> convert;
std::u16string dest = convert.from_bytes(std);
現在,他這樣說,第9所述JSON specification狀態:
9 String
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.
\"
represents the quotation mark character (U+0022).
\\
represents the reverse solidus character (U+005C).
\/
represents the solidus character (U+002F).
\b
represents the backspace character (U+0008).
\f
represents the form feed character (U+000C).
\n
represents the line feed character (U+000A).
\r
represents the carriage return character (U+000D).
\t
represents the character tabulation character (U+0009).
So, for example, a string containing only a single reverse solidus character may be represented as " \\
".
Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u
, followed by four hexadecimal digits that encode the code point. Hexadecimal digits can be digits (U+0030 through U+0039) or the hexadecimal letters A
through F
in uppercase (U+0041 through U+0046) or lowercase (U+0061 through U+0066). So, for example, a string containing only a single reverse solidus character may be represented as " \u005C
".
The following four cases all produce the same result:
" \u002F
"
" \u002f
"
" \/
"
" /
"
To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as " \uD834\uDD1E
".
原始JSON數據本身可以是UTF-8(最常見的編碼)進行編碼,UTF-16等,但無論使用的編碼的,字符序列"\u0418\u043d\u0434\u0435\u043a\u0441"
表示UTF-16碼單元序列U+0418 U+043d U+0434 U+0435 U+043a U+0441
,這是Unicode字符串"Индекс"
。
如果您使用實際的JSON解析器(如JSON for Modern C++,jsoncpp, RapidJSON等),它將爲您解析UTF-16 codeunit值並返回可讀的Unicode字符串。
但是,如果您手動處理JSON數據,則必須手動解碼任何\x
和\uXXXX
轉義序列。 std::wstring_convert
不能爲你做。它只能將std::string
的JSON轉換爲std::wstring
/std:::u16string
,如果這樣可以更輕鬆地解析數據。但是,您仍然需要分別解析JSON的內容。
之後,如果需要,可以使用std::wstring_convert
將提取的任何std::wstring
/std::u16string
字符串轉換回UTF-8以節省內存。
我很確定這些函數對'\ u'表示法沒有任何線索。 – tadman
他們不需要。編譯器會這樣做,因爲它可以處理字符串中的Unicode序列。如果OP想要在字符串中保留原始Unicode轉義序列,他會使用'\\ u0418'等(我的答案會不同)。 –