如何計算與C++

我寫了一個簡單的代碼來計算一個text.This不同的字符數的文本的Unicode字符數爲下面的代碼：如何計算與C++

#include <iostream> 
#include <fstream> 
#include <map> 
using namespace std; 
const char* filename="text.txt"; 
int main() 
{ 
    map<char,int> dict; 
    fstream f(filename); 
    char ch; 
    while (f.get(ch)) 
    { 
     if(!f.eof()) 
      cout<<ch; 
     if (!dict[ch]) 
      dict[ch]=0; 
     dict[ch]++; 
    } 
    f.close(); 
    cout<<endl; 
    for (auto it=dict.begin();it!=dict.end();it++) 
    { 
     cout<<(*it).first<<":\t"<<(*it).second<<endl; 
    } 
    system("pause"); 
}

程序做以及計算ASCII字符，但它不能在Unicode字符如漢字字符。如果我想要它能夠工作在Unicode字符如何解決問題？

來源

2013-05-20 羅澤軒

首先，你將需要解決一個編碼。你知道你打算使用哪種編碼嗎？然後你需要弄清楚「角色」到底是什麼意思。 –

沒有'unicode character'這樣的東西。您可以參考utf8everywhere.org獲取unicode中不同字符概念之間的區別，或者參考「twitter如何計算字符」文章來驗證不同的方法。無論哪種情況，計算代碼點都沒有什麼意義。 –

您需要一個Unicode庫來處理Unicode字符。編碼 - 說 - UTF8自己將是一個艱難的任務，並重新發明輪子。

在this Q/A from SO有一個很好的提到，你會發現其他答案的建議。

來源

2013-05-20 16:18:59

除了ring0的參考資料外，有一個很好的解釋 http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring以及 –

C++ 11處理Unicode，不是嗎？ – cubuspl42

對於這樣簡單的事情，自己解釋UTF-8非常簡單直接，並且可以避免必須經歷所有轉換工作。 –

有一切的寬字符版本，但如果你想要做的東西非常相似，你現在有什麼，都使用Unicode的16位版本：

map<short,int> dict; 
fstream f(filename); 
char ch; 
short val; 
while (1) 
{ 
    // Beware endian issues here - should work either way for char counting though. 
    f.get(ch); 
    val = ch; 
    f.get(ch); 
    val |= ch << 8; 

    if(val == 0) break; 

    if(!f.eof()) 
     cout<<val; 
    if (!dict[val]) 
     dict[val]=0; 
    dict[val]++; 
} 
f.close(); 
cout<<endl; 
for (auto it=dict.begin();it!=dict.end();it++) 
{ 
    cout<<(*it).first<<":\t"<<(*it).second<<endl; 
}

上面的代碼，使大量的假設（所有字符16位，甚至文件中的字節數等），但它應該做你想做的事情，或者至少讓你快速瞭解它可以如何處理寬字符。

來源

2013-05-20 16:23:40

不幸的是，有一些不是16位的字符。代碼只是將數字打印到屏幕上，儘管我已經使用static_cast來改變類型）。我不知道如何將數字映射到真實的字符。 –

首先，您要計算什麼？ Unicode碼點或字形集羣，即編碼意義上的字符，還是讀者感知的字符？另請注意，「寬字符」（16位字符）不是Unicode字符（UTF-16的長度與UTF-8類似，可變長度！）。

在任何情況下，獲得一個庫（如ICU）來執行實際的碼點/集羣迭代。對於計算你需要一個合適的類型，以便替換map的char類型（用於碼點，或字形集羣標準化弦32位unsigned int，正常化應該 - 再 - 用庫照顧）

ICU： http://icu-project.org

字形集羣：http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

正常化：http://unicode.org/reports/tr15/

來源

2013-05-20 16:31:01 Joe

是的。如果你想超越代碼點，並且對待讀者會認爲單個角色的內容，那麼這將是更多的工作。你也可能會認爲大多數讀者會考慮''A'' 和''a''是同一個字符，或者''a''和''是法語的相同字符，但是不同的字符在瑞典語。 –

你的英語也有這種情況。儘管使用分泌療法已經不太流行，但它仍然有時用於諸如合作或天真之類的文字中。 – Joe

在德語中，你甚至可以認爲ö應該算作o和e，因爲從技術上講，這兩個字母是收縮的（而不是像瑞典語那樣是一個字母） – Joe

如果你能compromize，只是指望代碼點，這是相當簡單直接使用UTF-8。然而，你的字典必須是std::map<std::string, int>。一旦你已經有了一個UTF-8的第一個字符：

while (f.get(ch)) { 
    static size_t const charLen[] = 
    { 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
      4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0, 
    } ; 
    int chLen = charLen[ static_cast<unsigned char>(ch) ]; 
    if (chLen <= 0) { 
     // error: impossible first character for UTF-8 
    } 
    std::string codepoint(1, ch); 
    -- chLen; 
    while (chLen != 0) { 
     if (!f.get(ch)) { 
      // error: file ends in middle of a UTF-8 code point. 
     } else if ((ch & 0xC0) != 0x80) { 
      // error: illegal following character in UTF-8 
     } else { 
      codepoint += ch; 
     } 
    } 
    ++ dict[codepoint]; 
}

你會注意到，大部分的代碼參與錯誤處理。

來源

2013-05-20 16:39:45

如何計算與C++

回答

相關問題