如何檢測終端中的unicode字符串寬度？

我正在研究一個基於終端的程序，它有unicode支持。在某些情況下，我需要確定一個字符串在打印之前會消耗多少終端列。不幸的是，有些字符是2列（中文等），但是我發現this answer表明檢測全角字符的好方法是通過調用ICU庫中的u_getIntPropertyValue（）。如何檢測終端中的unicode字符串寬度？

現在我試圖解析我的UTF8字符串的字符，並將它們傳遞給此函數。我現在遇到的問題是，u_getIntPropertyValue（）需要一個UTF-32代碼點。

什麼是從utf8字符串獲取這個最好的方法？我目前正在嘗試使用boost :: locale（在我的程序中的其他地方使用）執行此操作，但是我無法獲得乾淨的轉換。來自boost :: locale的我的UTF32字符串前面加上zero-width character來表示字節順序。顯然，我可以跳過字符串的前四個字節，但有沒有更清晰的方法來做到這一點？

這是我目前的醜陋的解決方案：

inline size_t utf8PrintableSize(const std::string &str, std::locale loc) 
{ 
    namespace ba = boost::locale::boundary; 
    ba::ssegment_index map(ba::character, str.begin(), str.end(), loc); 
    size_t widthCount = 0; 
    for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it) 
    { 
     ++widthCount; 
     std::string utf32Char = boost::locale::conv::from_utf(it->str(), std::string("utf-32")); 

     UChar32 utf32Codepoint = 0; 
     memcpy(&utf32Codepoint, utf32Char.c_str()+4, sizeof(UChar32)); 

     int width = u_getIntPropertyValue(utf32Codepoint, UCHAR_EAST_ASIAN_WIDTH); 
     if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE)) 
     { 
      ++widthCount; 
     } 

    } 
    return widthCount; 
}

來源

2016-05-23 KyleL

如果您已經使用ICU，爲什麼不使用它的UTF8到UTF32轉換呢？ –

我對ICU不熟悉。我試圖使用boost :: locale來隔離大多數複雜性。有沒有一種簡單的方法可以直接從ICU獲得這個utf32代碼點？ – KyleL

我對它並不熟悉，但我知道它擁有任何人從unicode庫中想要的一切。花一些時間與谷歌，你會發現它。 –

UTF-32是單個字符的「代碼點」的直接表示形式。因此，您只需從UTF-8字符中提取這些字符並將其提供給u_getIntPropertyValue即可。

我把你的代碼，並修改它使用u8_to_u32_iterator，這似乎是剛做這個：

#include <boost/regex/pending/unicode_iterator.hpp> 

inline size_t utf8PrintableSize(const std::string &str, std::locale loc) 
{ 
    size_t widthCount = 0; 
    for(boost::u8_to_u32_iterator<std::string::iterator> it(input.begin()), end(input.end()); it!=end; ++it) 
    { 
     ++widthCount; 

     int width = u_getIntPropertyValue(*it, UCHAR_EAST_ASIAN_WIDTH); 
     if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE)) 
     { 
      ++widthCount; 
     } 

    } 
    return widthCount; 
}

來源

2016-05-23 19:10:20

謝謝你的助推實施。有趣的是，這是正則表達式庫的一部分，而不是區域設置。 – KyleL

@牛米是正確的：有一個簡單的方法，直接用ICS做到這一點。更新後的代碼如下。我懷疑我可能只是使用UnicodeString並繞過整個提升語言環境的使用情況。

inline size_t utf8PrintableSize(const std::string &str, std::locale loc) 
{ 
    namespace ba = boost::locale::boundary; 
    ba::ssegment_index map(ba::character, str.begin(), str.end(), loc); 
    size_t widthCount = 0; 
    for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it) 
    { 
     ++widthCount; 

     //Note: Some unicode characters are 'full width' and consume more than one 
     // column on output. We will increment widthCount one extra time for 
     // these characters to ensure that space is properly allocated 
     UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(it->str())); 
     UChar32 codePoint = ucs.char32At(0); 

     int width = u_getIntPropertyValue(codePoint, UCHAR_EAST_ASIAN_WIDTH); 
     if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE)) 
     { 
      ++widthCount; 
     } 

    } 
    return widthCount; 
}

來源

2016-05-23 18:51:58 KyleL

不要忘記處理零寬度字符！ – o11c

@ o11c你知道如何檢查嗎？我用我的可能誤導的谷歌搜索翻起空白。 – KyleL

像{「Mn」，「Me」}或Default_Ignorable_Code_Point'中的'General_Category' - 後者包括格式化字符，軟連字符等等。但是，您還必須爲Hangul組合做更復雜的事情，這取決於什麼前面的字符是。 – o11c

如何檢測終端中的unicode字符串寬度？

回答

相關問題