2015-05-17 52 views
1

我正在嘗試編寫一個讀取文本文件的程序,按字符對其進行排序,並跟蹤每個字符在文檔中出現的次數。這是我迄今爲止所擁有的。對文本文件中的字符進行計數/排序

class Program 
{ 
    static void Main(string[] args) 
    { 
     CharFrequency[] Charfreq = new CharFrequency[128]; 

     try 
     {    
     string line; 
     System.IO.StreamReader file = new System.IO.StreamReader(@"C:\Users\User\Documents\Visual Studio 2013\Projects\Array_Project\wap.txt"); 
     while ((line = file.ReadLine()) != null) 
     { 
      int ch = file.Read(); 

      if (Charfreq.Contains(ch)) 
      { 

      }  
     } 

     file.Close(); 

     Console.ReadLine(); 
     } 
     catch (Exception e) 
     { 
      Console.WriteLine("The process failed: {0}", e.ToString()); 
     } 
    } 
} 

我的問題是,這裏的if語句應該怎麼辦?

我也有一個Charfrequency類,我將在這裏包括以防萬一它包含在它的幫助/必要中(並且是的,我需要使用數組而不是列表或數組列表)。

public class CharFrequency 
{ 
    private char m_character; 
    private long m_count; 

    public CharFrequency(char ch) 
    { 
     Character = ch; 
     Count = 0; 
    } 

    public CharFrequency(char ch, long charCount) 
    { 
     Character = ch; 
     Count = charCount; 
    } 

    public char Character 
    { 
     set 
     { 
      m_character = value; 
     } 

     get 
     { 
      return m_character; 
     } 
    } 

    public long Count 
    { 
     get 
     { 
      return m_count; 
     } 
     set 
     { 
      if (value < 0) 
       value = 0; 

      m_count = value; 
     } 
    } 

    public void Increment() 
    { 
     m_count++; 

    } 

    public override bool Equals(object obj) 
    { 
     bool equal = false; 
     CharFrequency cf = new CharFrequency('\0', 0); 

     cf = (CharFrequency)obj; 

     if (this.Character == cf.Character) 
      equal = true; 

     return equal; 
    } 

    public override int GetHashCode() 
    { 
     return m_character.GetHashCode(); 
    } 

    public override string ToString() 
    { 
     String s = String.Format("'{0}' ({1})  = {2}", m_character, (byte)m_character, m_count); 

     return s; 
    } 

} 
+0

你讀char的char?爲什麼如果你有ReadLine()調用? –

+0

readline不應該存在,它是更早的剩餘代碼形式。 – Cheeseop

+0

爲什麼不只是做一個「strbob = .ReadToEnd()」,然後通過strbob.length - strbob.replace(strloopchar).length()將字符集循環並放入數組? –

回答

1

,則不應使用Contains

首先你需要初始化你Charfreq陣列:

CharFrequency[] Charfreq = new CharFrequency[128]; 

for (int i = 0; i < Charferq.Length; i++) 
{ 
    Charfreq[i] = new CharFrequency((char)i); 
} 

try 

那麼你可以

int ch; 

// -1 means that there are no more characters to read, 
// otherwise ch is the char read 
while ((ch = file.Read()) != -1) 
{ 
    CharFrequency cf = new CharFrequency((char)ch); 

    // This works because CharFrequency overloads the 
    // Equals method, and the Equals method checks only 
    // for the Character property of CharFrequency 
    int ix = Array.IndexOf(Charfreq, cf); 

    // if there is the "right" charfrequency 
    if (ix != -1) 
    { 
     Charfreq[ix].Increment(); 
    }  
} 

請注意,這個不是我寫程序的方式。這是使程序正常工作所需的最小更改。

作爲旁註,這個程序將計數的ASCII字符 「頻率」(與代碼< = 127個字符)

CharFrequency cf = new CharFrequency('\0', 0); 

cf = (CharFrequency)obj; 

這是無用的初始化:

CharFrequency cf = (CharFrequency)obj; 

是足夠,否則你只需創建一個CharFrequency就可以放棄它。

1

字典非常適​​合這樣的任務。你沒有說哪個字符集和編碼文件在哪裏。因此,由於Unicode非常常見,我們假設Unicode字符集和UTF-8編碼。 (畢竟,它是.NET,Java,JavaScript,HTML,XML等的默認設置)。如果不是這樣,那麼請使用適用的編碼讀取文件並修復代碼,因爲您當前在您的系統中使用了UTF-8 StreamReader的。

接下來是迭代「字符」。然後增加字典中「字符」的計數,就像在文本中看到的那樣。

Unicode確實有一些複雜的功能。一種是組合字符,其中基本字符可以與變音符號等疊加。用戶將這樣的組合看作一個「字符」,或者如Unicode稱之爲字形。值得慶幸的是,.NET給出的是將它們作爲「文本元素」迭代的StringInfo類。

所以,如果你考慮一下,使用數組將會非常困難。您必須在陣列頂部構建自己的字典。

下面的示例使用字典,並且可以使用LINQPadscript運行。在它創建字典後,它會用一個很好的顯示來排序和轉儲它。

var path = Path.GetTempFileName(); 
// Get some text we know is encoded in UTF-8 to simplify the code below 
// and contains combining codepoints as a matter of example. 
using (var web = new WebClient()) 
{ 
    web.DownloadFile("http://superuser.com/questions/52671/which-unicode-characters-do-smilies-like-%D9%A9-%CC%AE%CC%AE%CC%83-%CC%83%DB%B6-consist-of", path); 
} 
// since the question asks to analyze a file 
var content = File.ReadAllText(path, Encoding.UTF8); 
var frequency = new Dictionary<String, int>(); 
var itor = System.Globalization.StringInfo.GetTextElementEnumerator(content); 
while (itor.MoveNext()) 
{ 
    var element = (String)itor.Current; 
    if (!frequency.ContainsKey(element)) 
    { 
     frequency.Add(element, 0); 
    } 
    frequency[element]++; 
} 
var histogram = frequency 
    .OrderByDescending(f => f.Value) 
    // jazz it up with the list of codepoints in each text element 
    .Select(pair => 
     { 
      var bytes = Encoding.UTF32.GetBytes(pair.Key); 
      var codepoints = new UInt32[bytes.Length/4]; 
      Buffer.BlockCopy(bytes, 0, codepoints, 0, bytes.Length); 
      return new { 
       Count = pair.Value, 
       textElement = pair.Key, 
       codepoints = codepoints.Select(cp => String.Format("U+{0:X4}", cp)) }; 
     }); 
histogram.Dump(); // For use in LINQPad 
+0

哇!我沒有注意到代理對和可組合字符的矯枉過正處理!我一直很喜歡Unicode的正確處理! :-) – xanatos

相關問題