2012-10-24 123 views
2

我有一個std ::字符串輸出。使用utf8proc我想將其轉換爲有效的utf8字符串。 http://www.public-software-group.org/utf8proc-documentationC++字符串到UTF8有效字符串使用utf8proc

typedef int int32_t; 
#define ssize_t int 
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options) 
Reencodes the sequence of unicode characters given by the pointer buffer and length as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the options field are regarded: (Documentation missing here) In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned. 
WARNING: The amount of free space being pointed to by buffer, has to exceed the amount of the input data by one byte, and the entries of the array pointed to by str have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash! 

因此,首先,我怎麼在末尾添加一個額外的字節?那麼如何將std :: string轉換爲int32_t * buffer?

這不起作用:

std::string g = output(); 
fprintf(stdout,"str: %s\n",g.c_str()); 
g += " "; //add an extra byte?? 
g = utf8proc_reencode((int*)g.c_str(), g.size()-1, 0); 
fprintf(stdout,"strutf8: %s\n",g.c_str()); 
+0

'std :: string'只是一個字節序列。什麼編碼是你的源'std :: string'中的? –

+0

每當我在C++程序中看到'printf'時,都會畏縮,尤其是輸出字符串。 –

+0

@Charles Bailey:輸出並不總是相同的編碼。通常它是utf8,但有時它是我現在知道的一些編碼。 –

回答

0

你很可能並不真正想要utf8proc_reencode() - 該功能需要一個有效的UTF-32緩衝區,把它變成一個有效的UTF-8緩衝區,但既然你說你不知道你的數據是什麼編碼,那麼你不能使用該功能。

因此,首先需要確定數據的實際編碼方式。您可以使用http://utfcpp.sourceforge.net/來測試您是否已使用有效的UTF-8和utf8::is_valid(g.begin(), g.end())。如果那是真的,你就完成了!

如果錯誤,事情會變得複雜......但ICU(http://icu-project.org/)可以幫助您;請參閱http://userguide.icu-project.org/conversion/detection

一旦您可靠地知道數據的編碼情況,ICU就可以再次幫助您獲得UTF-8。例如,假設您的源數據g位於ISO-8859-1:

UErrorCode err = U_ZERO_ERROR; // check this after every call... 
// CONVERT FROM ISO-8859-1 TO UChar 
UConverter *conv_from = ucnv_open("ISO-8859-1", &err); 
std::vector<UChar> converted(g.size()*2); // *2 is usually more than enough 
int32_t conv_len = ucnv_toUChars(conv_from, &converted[0], converted.size(), g.c_str(), g.size(), &err); 
converted.resize(conv_len); 
ucnv_close(conv_from); 
// CONVERT FROM UChar TO UTF-8 
g.resize(converted.size()*4); 
UConverter *conv_u8 = ucnv_open("UTF-8", &err); 
int32_t u8_len = ucnv_fromUChars(conv_u8, &g[0], g.size(), &converted[0], converted.size(), &err); 
g.resize(u8_len); 
ucnv_close(conv_u8); 
之後您的 g現在保存UTF-8數據。