UTF-8中的std :: string的子串？ C++ 11

我需要獲取std :: string中前N個字符的子字符串，假定它是utf8。我學會了很難的方式，.substr不能正常工作......因爲......預計。UTF-8中的std :: string的子串？ C++ 11

參考：我的琴絃很可能是這樣的：使命：\ n \ N1億2千萬匹

2015-06-23 Jonny

的問題是，UTF-8是一種可變長度編碼，每個字符可以是一到六個字節。雖然可以使用'std :: string'來存儲UTF-8字符串，但不能直接使用標準函數。你可以使用'substr'函數，但你必須使用一些特殊的代碼來查找子字符串的實際開始和結束。除非您擔心空間，否則您可能希望在內部以固定長度編碼存儲字符串，如UTF-32。 –

Like [this]（http://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11）link說：「標準庫不支持Unicode（對於任何支持的合理含義）。 std :: string並不比std :: vector更好：：它完全忽略了Unicode（或任何其他表示/編碼），並簡單地將其內容視爲一個字節塊。 – paulsm4

即使使用UTF-32，也可以無意中將字符（例如，重音符）切掉。如果你真的需要，我會考慮ICU（http://site.icu-project.org）或一些類似的庫，以處理Unicode的所有榮耀。 –

我found此代碼，我只是嘗試一下。

std::string utf8_substr(const std::string& str, unsigned int start, unsigned int leng) 
{ 
    if (leng==0) { return ""; } 
    unsigned int c, i, ix, q, min=std::string::npos, max=std::string::npos; 
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++) 
    { 
     if (q==start){ min=i; } 
     if (q<=start+leng || leng==std::string::npos){ max=i; } 

     c = (unsigned char) str[i]; 
     if  (
       //c>=0 && 
       c<=127) i+=0; 
     else if ((c & 0xE0) == 0xC0) i+=1; 
     else if ((c & 0xF0) == 0xE0) i+=2; 
     else if ((c & 0xF8) == 0xF0) i+=3; 
     //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8 
     //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8 
     else return "";//invalid utf8 
    } 
    if (q<=start+leng || leng==std::string::npos){ max=i; } 
    if (min==std::string::npos || max==std::string::npos) { return ""; } 
    return str.substr(min,max); 
}

更新：這很適合於我目前的問題。我必須將它與get-length-of-utf8encoded-stdsstring函數混合使用。

這種解決方案有一些警告它吐口水在我的編譯器：

Some warnings spit out by my compiler.

來源

2015-06-23 06:59:23 Jonny

C++已經有一個2字節的字符串，'std：wstring'，這是算法支持。在讀取重寫每個算法以處理使用ASCII字符串類型（std :: string）但表現得像別的東西的「魔術」字符串時，最好將UTF8內容轉換爲Unicode和'wstring'。 –

@PanagiotisKanavos：'std :: wstring '不是2個字節。請閱讀http://utf8everywhere.org/ – DanielKO

@DanielKO C++引用是可取的，是的，wchar_t是2個或更多字節依賴於實現 - 所以char16_t或char32_t是可取的。我發現事情再次發生變化，並且我們有映射到'char16_t *'或'char32_t'的Unicode文字。還有映射到'char *'的UTF8編碼文字！還有u16string和u32string。不知道他們的STL支持 - 誰移動了我的奶酪！ –

你可以使用升壓/區域設置庫utf8的字符串轉換爲wstring的。然後用正常.substr（）方法：

#include <iostream> 
#include <boost/locale.hpp> 

std::string ucs4_to_utf8(std::u32string const& in) 
{ 
    return boost::locale::conv::utf_to_utf<char>(in); 
} 

std::u32string utf8_to_ucs4(std::string const& in) 
{ 
    return boost::locale::conv::utf_to_utf<char32_t>(in); 
} 

int main(){ 

    std::string utf8 = u8"1億2千萬匹"; 

    std::u32string part = utf8_to_ucs4(utf8).substr(0,3); 

    std::cout<<ucs4_to_utf8(part)<<std::endl; 
    // prints : 1億2 
    return 0; 
}

來源

2015-06-23 12:40:10

'wstring'在一般情況下不會在單個'wchar_t'中存儲單個字符。這僅適用於unicode的受限子集。你的函數名是錯誤的：ucs4不適合16位的'wchar_t'。 – Yakk

@Yakk你是對的。我把它與char32_t混合在一起，它總是32位 - 對應於ucs4編碼。（我相應地更改了代碼片段。） –

仍缺少組合字符支持。您可能正在以意想不到的方式處理從左到右和從右到左的標記https://en.wikipedia.org/wiki/Bi-directional_text#Unicode_bidi_support。 [結合字形連接]（https://en.wikipedia.org/wiki/Combining_Grapheme_Joiner），[結合字符]（https://en.wikipedia.org/wiki/Combining_character），[BOM]（https：// en.wikipedia。org/wiki/Byte_order_mark），（已刪除的語言標籤），[變體選擇器]（https://en.wikipedia.org/wiki/Variant_form_%28Unicode%29）等。 – Yakk

基於this答案我已經寫了我的UTF8字符串函數：

void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring) 
{ 
    int len = 0, byteIndex = 0; 
    const char* aStr = originalString.c_str(); 
    size_t origSize = originalString.size(); 

    for (byteIndex=0; byteIndex < origSize; byteIndex++) 
    { 
     if((aStr[byteIndex] & 0xc0) != 0x80) 
      len += 1; 

     if(len >= SubStrLength) 
      break; 
    } 

    csSubstring = originalString.substr(0, byteIndex); 
}

來源

2016-01-22 06:49:07 Atul

UTF-8中的std :: string的子串？ C++ 11

回答

相關問題