清理非字母字符的字符串ma

我想清理C++中的字符串。我想清除所有非字母字符，並且保留所有種類的英文和非英文字母。我的一個測試的代碼看起來像這樣清理非字母字符的字符串ma

int main() 
{ 
string test = "Danish letters: Æ Ø Å !!!!!!??||~"; 
cout << "Test = " << test << endl; 

for(int l = 0;l<test.size();l++) 
{ 
    if(!isalpha(test.at(l)) && test.at(l) != ' ') 
    { 
     test.replace(l,1," nope"); 
    } 
} 

cout << "Test = " << test << endl; 

return 0;

}

這使我的輸出：

Test = Danish letters: Æ Ø Å !!!!!!??||~ 
Test = Danish letters nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope"

所以我的問題是，我怎麼刪除「!!!!! ！|| ||「而不是」ÆØÅ「？

我也試了測試，如

test.at(l)!='Å'

，但我我不能編譯，如果我宣佈「A」爲char。

我讀過關於unicode和utf8的內容，但我不太明白。

請幫我:)

來源

2016-10-01 user2994461

那麼，你需要不斷閱讀關於Unicode和UTF8直到你瞭解它，然後一切都應該是一清二楚。 –

您可能想看看標題爲[如何從字符串中去除所有非字母數字字符]的SO問題（http://stackoverflow.com/questions/6319872/how-to-strip-all-non-alphanumeric-characters-從-A-字符串在-C）。我也有興趣看看[std :: isalnum]（http://en.cppreference.com/w/cpp/string/byte/isalnum）是否適用於你的情況。 – 2016-10-01 20:49:29

@RawN：這兩個鏈接僅適用於ASCII，這個問題（隱含地）是關於非ASCII的。 –

char用於ASCII字符集，而你正試圖使上具有非ASCII字符的字符串操作。

您對Unicode字符進行操作，所以你需要使用寬字符串操作：

int main() 
{ 
    wstring test = L"Danish letters: Æ Ø Å !!!!!!??||~"; 
    wcout << L"Test = " << test << endl; 

    for(int i = 0; i < test.size(); i++) { 

     if(!iswalpha(test.at(i)) && test.at(i) != ' ') { 

      test.replace(i,1,L" nope"); 
     } 
    } 

    wcout << L"Test = " << test << endl; 

    return 0; 
}

您也可以使用QT和使用QString，所以相同的代碼和平將成爲：

QString test = "Danish letters: Æ Ø Å !!!!!!??||~"; 
qDebug() << "Test =" << test; 

for(int i = 0; i < test.size(); i++) { 

    if(!test.at(i).isLetterOrNumber() && test.at(i) != ' ') { 

     test.replace(i, 1, " nope"); 
    } 
} 

qDebug() << "Test = " << test;

來源

2016-10-01 22:13:37

是的，這段代碼只留下英文和非英文字符，因爲我們正在使用iswalpha。 –

哇，我的表情符號很糟糕的想法。從頭開始：C++寬泛函數和類只能在基本的多語言平面上工作，並且在給定補充平面中的字符時失敗，其中當前包含73000個字符，其中一些必須是字母字符。 iswalpha是_broken_。 https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane –

@MooingDuck寬字符API與*實現定義的*固定寬度編碼一起工作，可能與Unicode無關。它可以像Windows一樣基於UTF-16，其效果是不能正確處理BMP以外的字符，或者可以使用類似於Linux上的UTF-32，這使得可以完全支持Unicode。或者它可以使用完全不同的字符集。 – nwellnhof

這是一個代碼示例，您可以使用不同的語言環境和實驗進行遊戲，以便獲得想要的內容。您可以嘗試使用u16string，u32string等。使用語言環境在開始時有點混亂。大多數人用ASCII編程。

在主函數調用一個我寫

#include <iostream> 
#include <string> 
#include <codecvt> 
#include <sstream> 
#include <locale> 

wstring test = L"Danish letters: Æ Ø Å !!!!!!??||~ Πυθαγόρας ὁ Σάμιος"; 
removeNonAlpha(test); 


wstring removeNonAlpha(const wstring &input) { 
    typedef codecvt<wchar_t, char, mbstate_t> Cvt; 
    locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8")); 
    wcout.imbue(utf8locale); 
    wcout << input << endl; 
    wstring res; 
    std::locale loc2("en_US.UTF8"); 
    for(wstring::size_type l = 0; l<input.size(); l++) { 
     if(isalpha(input[l], loc2) || isspace(input[l], loc2)) { 
     cout << "is char\n"; 
     res += input[l]; 
     } 
     else { 
     cout << "is not char\n"; 
     } 
    } 
    wcout << L"Hello, wide to multybyte world!" << endl; 
    wcout << res << endl; 
    cout << std::isalpha(L'Я', loc2) << endl; 
    return res; 
}

來源

2016-10-01 23:31:05

'wchar_t'不保證足夠寬以表示Unicode代碼點。在Windows上它是16位，代表一個UTF-16代碼單元，而不是代碼點。 – roeland

清理非字母字符的字符串ma

回答

相關問題