libxml2 xmlChar * to std :: wstring

libxml2似乎將所有的字符串存儲在UTF-8中，如xmlChar *。libxml2 xmlChar * to std :: wstring

/** 
* xmlChar: 
* 
* This is a basic byte in an UTF-8 encoded string. 
* It's unsigned allowing to pinpoint case where char * are assigned 
* to xmlChar * (possibly making serialization back impossible). 
*/ 
typedef unsigned char xmlChar;

由於libxml2是一個C庫，沒有提供程序來得到一個std::wstring出xmlChar *的。我想知道的謹慎方式是否xmlChar *轉換爲在C++ std::wstring 11是使用mbstowcs C函數，通過這樣的事情（工作正在進行中）：

std::wstring xmlCharToWideString(const xmlChar *xmlString) { 
    if(!xmlString){abort();} //provided string was null 
    int charLength = xmlStrlen(xmlString); //excludes null terminator 
    wchar_t *wideBuffer = new wchar_t[charLength]; 
    size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength); 
    if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed 
    std::wstring wideString(wideBuffer, wcharLength); 
    delete[] wideBuffer; 
    return wideString; 
}

編輯：只是一個供參考，我很清楚xmlStrlen返回什麼;這是用於存儲字符串的xmlChar的數量;我知道這不是個字符的數量而是unsigned char的數量。如果我已經將它命名爲byteLength，那就不那麼令人困惑了，但我認爲它會更清晰，因爲我既有charLength也有wcharLength。至於代碼的正確性，wideBuffer將會是大於或等於到保存緩衝區所需的大小，總是（我相信）。由於需要比wide_t更多空間的字符將被截斷（我認爲）。

來源

2013-01-01 Mr. Smith

如果您想談論最謹慎的行爲方式，請避免使用'wchar_t'和'wstring'。使用Unicode時，它們比弊端更好。 –

xmlStrlen()返回xmlChar*字符串中UTF-8編碼碼單元的數量。當數據轉換時，編碼碼單元的編號不會相同，因此不要使用xmlStrlen()來分配wchar_t字符串的大小。您需要撥打std::mbtowc()一次以獲得正確的長度，然後分配內存，並再次撥打mbtowc()填充內存。您還必須使用std::setlocale()來告知mbtowc()使用UTF-8（與區域設置混合可能不是一個好主意，特別是涉及多個線程時）。例如：

std::wstring xmlCharToWideString(const xmlChar *xmlString) 
{  
    if (!xmlString) { abort(); } //provided string was null 

    std::wstring wideString; 

    int charLength = xmlStrlen(xmlString); 
    if (charLength > 0) 
    { 
     char *origLocale = setlocale(LC_CTYPE, NULL); 
     setlocale(LC_CTYPE, "en_US.UTF-8"); 

     size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator 
     if (wcharLength != (size_t)(-1)) 
     { 
      wideString.resize(wcharLength); 
      mbtowc(&wideString[0], (const char*) xmlString, charLength); 
     } 

     setlocale(LC_CTYPE, origLocale); 
     if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed 
    } 

    return wideString; 
}

一個更好的選擇，因爲你提到C++ 11，是使用std::codecvt_utf8與std::wstring_convert代替你不必應付語言環境：

std::wstring xmlCharToWideString(const xmlChar *xmlString) 
{  
    if (!xmlString) { abort(); } //provided string was null 
    try 
    { 
     std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv; 
     return conv.from_bytes((const char*)xmlString); 
    } 
    catch(const std::range_error& e) 
    { 
     abort(); //wstring_convert failed 
    } 
}

另一種選擇是使用實際的Unicode庫（如ICU或ICONV）來處理Unicode轉換。

來源

2013-01-01 02:13:02

'std :: codecvt_utf8'與'std :: wstring_convert'看起來很不錯。謝謝！ –

此代碼中存在一些問題，除了您正在使用wchar_t和std::wstring這一事實，除非您正在調用Windows API，否則這是一個壞主意。

xmlStrlen()不會做你認爲它做的事。它計算字符串中UTF-8代碼單元（又名.a.字節）的數量。它不計算字符的數量。這是documentation中的全部內容。
無論如何，計數字符不會輕易地爲您提供wchar_t陣列的正確大小。因此，xmlStrlen()不僅做你認爲它做的事情，你想要的也不是正確的事情。問題在於wchar_t的編碼因平臺而異，因此對於便攜式代碼來說它是100％無用的。
mbtowcs()函數是區域設置相關的。如果語言環境是UTF-8語言環境，它只會從UTF-8轉換而來！
如果std::wstring構造函數拋出異常，此代碼將泄漏內存。

我的建議：

使用UTF-8，如果在所有可能的。wchar_t兔子洞是很多額外的工作沒有好處（除了製作Windows API調用的能力）。
如果您需要UTF-32，請使用std::u32string。請記住，wstring具有平臺相關編碼：它可以是可變長度編碼（Windows）或固定長度（Linux，OS X）。

如果您絕對必須擁有wchar_t，那麼您在Windows上的機會很大。這裏是你如何做到這一點在Windows上：

std::wstring utf8_to_wstring(const char *utf8) 
{ 
    size_t utf8len = std::strlen(utf8); 
    int wclen = MultiByteToWideChar(
     CP_UTF8, 0, utf8, utf8len, NULL, 0); 
    wchar_t *wc = NULL; 
    try { 
     wc = new wchar_t[wclen]; 
     MultiByteToWideChar(
      CP_UTF8, 0, utf8, utf8len, wc, wclen); 
     std::wstring wstr(wc, wclen); 
     delete[] wc; 
     wc = NULL; 
     return wstr; 
    } catch (std::exception &) { 
     if (wc) 
      delete[] wc; 
    } 
}

如果你絕對必須有wchar_t，你是不是在Windows中，使用iconv()（見man 3 iconv，爲手工man 3 iconv_open和man 3 iconv_close）。您可以指定"WCHAR_T"作爲iconv()的其中一種編碼。

記住：你可能不希望wchar_t或std::wstring。什麼wchar_t可移植沒有用處，並使其有用是不便攜的。這就是生活。

來源

2013-01-01 02:04:24

libxml2 xmlChar * to std :: wstring

回答

相關問題