使用utf8語言環境進行C++字符串處理

我想對utf8文本文件進行一些簡單的字符串處理。這將意味着從一條線中取出子串並將它們重新排列。使用utf8語言環境進行C++字符串處理

由於我的linux電腦有一個utf8語言環境，我不打算在其他地方運行程序，因此將語言環境設置爲utf8似乎是要走的路。調整一個例子，我得到了以下測試程序。如果你給它一個希臘詞，它會輸出相同的結果，但輸出substr的結果只會產生垃圾。是否有另一個我可以使用或正在使用utf8語言環境的函數完全是錯誤的路徑？

#include <string> 
    #include <iostream> 

    int main() 
    { 
     std::string newwd; 
     setlocale(LC_ALL, ""); 
     std::cout << "Enter greek word "; 
     std::string wordgr; 
     std::getline(std::cin, wordgr); 
     std::cout << "The word is " << wordgr << "." << std::endl; 
     newwd=wordgr.substr(2,1) ; 
     std::cout << "3rd letter is " << wordgr.substr(2,1) << " <" << std::endl; 
     return 0; 
    }

來源

2014-01-06 daivid

UTF-8是一種可變長度編碼; UTF-8中的給定字符可以在1到6個字節之間。這會導致substr（）方法，*對字節進行操作，而對字符*進行操作以產生意外的結果。 UTF-8中的希臘字符不是單字節字符。如果輸入4個字符的希臘字符串，然後在該字上調用'std :: string.length（）'，則會得到大於4個字節（最可能是8個字節）的結果。 –

@KenP你應該發佈這個答案。 :) – 0x499602D2

一個非常簡單的解決方案是在整個過程中切換到wstring和wiostream和wchar_t。 –

如果你在你的應用程序中使用UTF-8，你需要考慮適當的庫：utf8-cpp。 std :: string或std :: wstring不是一個選項，因爲UTF-8字符的長度可變，請檢查wiki瞭解更多信息。

以下是證明此概念的示例代碼。

#include <string> 
#include <iostream> 
#include "source/utf8.h" // path to the utf8-cpp library header 

int main() 
{ 
     setlocale(LC_ALL, ""); 
     std::cout << "Enter greek word "; 
     std::string wordgr; 
     std::getline(std::cin, wordgr); 
     std::cout << "The word is " << wordgr << "." << std::endl; 
     std::string::iterator end_it = utf8::find_invalid(wordgr.begin(), wordgr.end()); 
     if (end_it != wordgr.end()) { 
       std::cout << "Invalid utf-8 encoding" << std::endl; 
       return 0; 
     } 
     // utf-8 string length 
     std::cout << "Length is " << utf8::distance(wordgr.begin(), end_it) << std::endl; 

     // utf-8 string symbol traverse 
     std::string::iterator curr_it = wordgr.begin(); 
     std::string::iterator next_it = curr_it; 
     utf8::next(next_it, wordgr.end()); 
     while(curr_it != wordgr.end()) { 
       std::cout << std::string(curr_it, next_it) << " - "; 
       curr_it = next_it; 
       if (next_it != wordgr.end()) { 
         utf8::next(next_it, wordgr.end()); 
       } 
     } 
     return 0; 
}

輸出是如下：

./a.out 
Enter greek word Вова 
The word is Вова. 
Length is 4 
В - о - в - а -

來源

2014-01-06 17:11:14 vershov

@n.m .:它怎麼錯了？即使是更廣泛的CharT類型也不能改變「std :: basic_string」不是Unicode字符的容器這一基本事實，也不能使它成爲一個容器。你需要一個抽象的頂部。 –

@LightnessRacesInOrbit對於有問題的實現'std :: wstring' *是一個Unicode代碼點的容器。 Codepoints不是很有特色，但是utf8-cpp也不提供代碼點。 –

@n.m .:好的，那麼答案是錯誤的，因爲utf8-cpp不是解決方案:)謝謝 –

UTF-8是一種可變長度編碼; UTF-8中的給定字符可以在1到6個字節之間。這會導致substr（）方法對字節進行操作，而對字符不產生意外的結果。 UTF-8中的希臘字符不是單字節字符。如果輸入4個字符的希臘字符串，然後在該字上調用std::string.length()，則會得到大於4個字節（最可能是8個字節）的結果。

來源

2014-01-06 18:17:11

這在我的系統and on IDEOne上按預期工作。

#include <string> 
#include <iostream> 

int main() 
{ 
    std::wstring newwd; 
    setlocale(LC_ALL, ""); 
    std::wcout << "Enter greek word "; 
    std::wstring wordgr; 
    std::getline(std::wcin, wordgr); 
    std::wcout << "The word is " << wordgr << "." << std::endl; 
    newwd=wordgr.substr(2,1) ; 
    std::wcout << "3rd letter is " << wordgr.substr(2,1) << " <" << std::endl; 
    return 0; 
}

來源

2014-01-06 20:13:28

不錯的簡單例子。非常感謝。 – daivid

它工作不正確，並顯示「第三個字母是**？**」而不是「第三個字母是**」** – vershov

@vershov您的默認語言環境可能不是UTF-8。你的輸入是什麼（十六進制轉儲）？ –

使用utf8語言環境進行C++字符串處理

回答

相關問題