C++檢查，如果字符串具有有效的UTF-8字符

我想使用ICU庫來測試，如果一個字符串具有無效的UTF-8字符。我創建了一個utf-8轉換器，但沒有無效的數據給我轉換錯誤。感謝你的幫助。C++檢查，如果字符串具有有效的UTF-8字符

感謝，普拉香特

int main()                       
{          
string str ("AP1120 CorNet-IP v5.0 v5.0.1.22 òÀ MIB 1.5.3.50 Profile EN-C5000"); 
// string str ("example string here"); 
// string str (" ����������" );     
    UErrorCode status = U_ZERO_ERROR;     
    UConverter *cnv;    
    const char *sourceLimit;  
    const char * source = str.c_str();     
    cnv = ucnv_open("utf-8", &status);                
    assert(U_SUCCESS(status));                  

    UChar *target;                     
    int sourceLength = str.length();                 
    int targetLimit = 2 * sourceLength;                
    target = new UChar[targetLimit];                 

    ucnv_toUChars(cnv, target, targetLimit, source, sourceLength, &status); 
    cout << u_errorName(status) << endl; 
    assert(U_SUCCESS(status));       
}

來源

2012-03-02 user1245457

不熟悉這個庫，但在我看來，如果你用'「utf-8」'打開你的轉換器，然後調用'ucnv_toUChars'進行轉換，是不是你或多或少告訴它將Unicode轉換爲Unicode？在這種情況下，它可能會成功短路。我會嘗試用iso編碼或其他東西打開它。 – AJG85 2012-03-02 20:14:21

我修改你的程序打印出實際的字符串，前後：現在

#include <unicode/ucnv.h> 
#include <string> 
#include <iostream> 
#include <cassert> 
#include <cstdio> 

int main() 
{ 
    std::string str("22 òÀ MIB 1"); 
    UErrorCode status = U_ZERO_ERROR; 
    UConverter * const cnv = ucnv_open("utf-8", &status); 
    assert(U_SUCCESS(status)); 

    int targetLimit = 2 * str.size(); 
    UChar *target = new UChar[targetLimit]; 

    ucnv_toUChars(cnv, target, targetLimit, str.c_str(), -1, &status); 

    for (unsigned int i = 0; i != targetLimit && target[i] != 0; ++i) 
     std::printf("0x%04X ", target[i]); 
    std::cout << std::endl; 
    for (char c : str) 
     std::printf("0x%02X ", static_cast<unsigned char>(c)); 
    std::cout << std::endl << "Status: " << status << std::endl; 
}

，用默認的編譯器設置，我得到：

0x0032 0x0032 0x0020 0x00F2 0x00C0 0x0020 0x004D 0x0049 0x0042 0x0020 0x0031 
0x32 0x32 0x20 0xC3 0xB2 0xC3 0x80 0x20 0x4D 0x49 0x42 0x20 0x31

也就是說，輸入已經是UTF -8。這是我的編輯器，保存在UTF-8（在十六進制編輯器可驗證）的文件的陰謀，以及海灣合作委員會，設置的是執行字符集爲UTF-8。

您可以強制GCC更改這些參數。例如，強制執行字符（通過-fexec-charset=iso-8859-1）設置爲ISO-8859-1產生這樣的：

0x0032 0x0032 0x0020 0xFFFD 0xFFFD 0x0020 0x004D 0x0049 0x0042 0x0020 0x0031 
0x32 0x32 0x20 0xF2 0xC0 0x20 0x4D 0x49 0x42 0x20 0x31

正如你可以看到，輸入現在是ISO-8859-1編碼，並且轉換prompty 失敗併產生「無效字符」碼點U + FFFD。

但是，轉換操作仍返回「成功」狀態。看起來庫不考慮用戶數據轉換錯誤是函數調用的錯誤。相反，錯誤狀態似乎僅用於空間不足等情況。

來源

2012-03-02 20:40:04

有趣的是，我的猜測有點接近。 +1進行實驗。我正要回來發帖說ucnv_getInvalidUChars可能更適合OP，但如果適用的話，您的回答可能會更好。 – AJG85 2012-03-02 21:05:11

感謝您的回答，現在有道理，爲什麼轉換沒有失敗。出於測試目的，如果我想繼續使用默認的gcc字符集，是否可以將輸入保存爲原始格式而不是UTF-8格式？ – user1245457 2012-03-05 21:51:18

@ user1245457：示例中沒有輸入，只有源代碼中的硬編碼數據。實際*輸入*沒有任何反應，它只是一個不透明的字節流，您可以隨意保存。 – 2012-03-05 21:57:31

C++檢查，如果字符串具有有效的UTF-8字符

回答

相關問題