如何加快此代碼？

我得到了以下用於讀取txt文件並返回字典的方法。讀取〜5MB文件需要大約7分鐘（67000行，每行70個字符）。如何加快此代碼？

public static Dictionary<string, string> FASTAFileReadIn(string file) 
{ 
    Dictionary<string, string> seq = new Dictionary<string, string>(); 

    Regex re; 
    Match m; 
    GroupCollection group; 
    string currentName = string.Empty; 

    try 
    { 
     using (StreamReader sr = new StreamReader(file)) 
     { 
      string line = string.Empty; 
      while ((line = sr.ReadLine()) != null) 
      { 
       if (line.StartsWith(">")) 
       {// Match Sequence 
        re = new Regex(@"^>(\S+)"); 
        m = re.Match(line); 
        if (m.Success) 
        { 
         group = m.Groups; 
         if (!seq.ContainsKey(group[1].Value)) 
         { 
          seq.Add(group[1].Value, string.Empty); 
          currentName = group[1].Value; 
         } 
        } 
       } 
       else if (Regex.Match(line.Trim(), @"\S+").Success && 
          currentName != string.Empty) 
       { 
        seq[currentName] += line.Trim(); 
       } 
      } 
     } 
    } 
    catch (IOException e) 
    { 
     Console.WriteLine("An IO exception has benn thrown!"); 
     Console.WriteLine(e.ToString()); 
    } 
    finally { } 

    return seq; 
}

代碼的哪些部分是最耗時的，如何加快步伐？

感謝

來源

2012-07-24 Mavershang

相關：http://stackoverflow.com/questions/3927/what-are-some-good-net-profilers – 2012-07-24 03:05:33

@布萊恩，謝謝，這可以節省一些時間。 :) – sarnold 2012-07-24 03:05:49

不要每次都創建一個新的正則表達式。創建一次，並使用'RegexOptions.Compiled'標誌來獲得額外的性能。 – Ryan 2012-07-24 03:06:55

緩存並編譯正則表達式，重新排序條件，減少配料數量等。

public static Dictionary<string, string> FASTAFileReadIn(string file) { 
    var seq = new Dictionary<string, string>(); 

    Regex re = new Regex(@"^>(\S+)", RegexOptions.Compiled); 
    Regex nonWhitespace = new Regex(@"\S", RegexOptions.Compiled); 
    Match m; 
    string currentName = string.Empty; 

    try { 
     foreach(string line in File.ReadLines(file)) { 
      if(line[0] == '>') { 
       m = re.Match(line); 

       if(m.Success) { 
        if(!seq.ContainsKey(m.Groups[1].Value)) { 
         seq.Add(m.Groups[1].Value, string.Empty); 
         currentName = m.Groups[1].Value; 
        } 
       } 
      } else if(currentName != string.Empty) { 
       if(nonWhitespace.IsMatch(line)) { 
        seq[currentName] += line.Trim(); 
       } 
      } 
     } 
    } catch(IOException e) { 
     Console.WriteLine("An IO exception has been thrown!"); 
     Console.WriteLine(e.ToString()); 
    } 

    return seq; 
}

然而，這只是一個吶ï已經優化。閱讀FASTA格式，我寫道：

public static Dictionary<string, string> ReadFasta(string filename) { 
    var result = new Dictionary<string, string> 
    var current = new StringBuilder(); 
    string currentKey = null; 

    foreach(string line in File.ReadLines(filename)) { 
     if(line[0] == '>') { 
      if(currentKey != null) { 
       result.Add(currentKey, current.ToString()); 
       current.Clear(); 
      } 

      int i = line.IndexOf(' ', 2); 

      currentKey = i > -1 ? line.Substring(1, i - 1) : line.Substring(1); 
     } else if(currentKey != null) { 
      current.Append(line.TrimEnd()); 
     } 
    } 

    if(currentKey != null) 
     result.Add(currentKey, current.ToString()); 

    return result; 
}

告訴我，如果它的工作;它應該快得多。

來源

2012-07-24 03:14:27 Ryan

File.ReadAllLines（）中的字符串行是否一次性從文件構建整個（數組？列表？），還是按需構建每個「行」？ – sarnold 2012-07-24 03:16:59

@sarnold：對不起，你是對的。我的意思是'ReadLines（）'，它創建一個'IEnumerable '。（雖然如果該文件只有5MB，那麼讀起來可能是有益的，因爲開始時...） – Ryan 2012-07-24 03:18:21

是的，五個megs，它可能並不重要。但是，我已經看到過一些_huge_FASTA文件.. – sarnold 2012-07-24 03:19:38

我希望編譯器會自動執行此操作，但我注意到的第一件事是你重新編譯每個匹配的行正則表達式：

  while ((line = sr.ReadLine()) != null) 
      { 
       if (line.StartsWith(">")) 
       {// Match Sequence 
        re = new Regex(@"^>(\S+)");

即使你更好可以完全刪除正則表達式;大多數語言提供某種經常抽菸的正則表達式的split功能...

來源

2012-07-24 03:08:42 sarnold

同意，'re'應該在循環之外明確定義。 – matchdav 2012-07-24 03:11:33

我對此做了統計，最好的方法是使它們成爲靜態的並使用'RegexOptions.Compiled'。 – 2012-07-24 03:22:42

您可以通過大幅度提高閱讀速度的BufferedStream：

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) 
using (BufferedStream bs = new BufferedStream(fs)) 
using (StreamReader sr = new StreamReader(bs)) 
{ 
    // Use the StreamReader 
}

提到的Regex重新編譯@sarnold可能是你最大的性能殺手，但是，如果你的處理時間是5分鐘。

來源

2012-07-24 03:10:49

哈，當我看到你的回答時，我的第一個想法是，「嘿，我敢打賭，這是減速90％來自哪裏。」 – sarnold 2012-07-24 03:15:41

下面是我將如何寫它。沒有更多的信息（即平均字典條目的時間），我無法優化StingBuilder的容量。您也可以關注Eric J.的建議並添加BufferedStream。理想情況下，如果您想要提高性能，則完全不用Regular Expressions，但編寫和管理起來要容易得多，所以我明白您爲什麼要使用它們。

public static Dictionary<string, StringBuilder> FASTAFileReadIn(string file) 
{ 
    var seq = new Dictionary<string, StringBuilder>(); 
    var regName = new Regex("^>(\\S+)", RegexOptions.Compiled); 
    var regAppend = new Regex("\\S+", RegexOptions.Compiled); 

    Match tempMatch = null; 
    string currentName = string.Empty; 
    try 
    { 
     using (StreamReader sReader = new StreamReader(file)) 
     { 
      string line = string.Empty; 
      while ((line = sReader.ReadLine()) != null) 
      { 
       if ((tempMatch = regName.Match(line)).Success) 
       { 
        if (!seq.ContainsKey(tempMatch.Groups[1].Value)) 
        { 
         currentName = tempMatch.Groups[1].Value; 
         seq.Add(currentName, new StringBuilder()); 
        } 
       } 
       else if ((tempMatch = regAppend.Match(line)).Success && currentName != string.Empty) 
       { 
        seq[currentName].Append(tempMatch.Value); 
       } 
      } 
     } 
    } 
    catch (IOException e) 
    { 
     Console.WriteLine("An IO exception has been thrown!"); 
     Console.WriteLine(e.ToString()); 
    } 

    return seq; 
}

正如你所看到的，我稍微改變你的字典使用優化StringBuilder類附加價值。我也一次性預編譯正則表達式，以確保您不會一遍又一遍重複編譯相同的正則表達式。我也提取了你的「附加」情況以編譯成正則表達式。

讓我知道，如果這可以幫助你表現明智。

來源

2012-07-24 03:32:38

如何加快此代碼？

回答

相關問題