2016-06-28 66 views
0

我有2文本文件是如下(如1466786391大量是唯一時間戳):合併兩個文本文件刪除重複

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 49 packets received, 2% packet loss 
round-trip min/avg/max = 20.917/70.216/147.258 ms 
1466786342 
PING 10.0.0.6 (10.0.0.6): 56 data bytes 

.... 

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 50 packets received, 0% packet loss 
round-trip min/avg/max = 29.535/65.768/126.983 ms 
1466786391 

這:

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 49 packets received, 2% packet loss 
round-trip min/avg/max = 20.917/70.216/147.258 ms 
1466786342 
PING 10.0.0.6 (10.0.0.6): 56 data bytes 

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 50 packets received, 0% packet loss 
round-trip min/avg/max = 29.535/65.768/126.983 ms 
1466786391 
PING 10.0.0.6 (10.0.0.6): 56 data byte 

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 44 packets received, 12% packet loss 
round-trip min/avg/max = 30.238/62.772/102.959 ms 
1466786442 
PING 10.0.0.6 (10.0.0.6): 56 data bytes 
.... 

所以第一文件以timestamp 結尾,並且第二個文件在中間的某個位置具有相同的數據塊,之後具有更多的數據,具體時間戳之前的數據是與第一個文件完全相同。

所以我想輸出是這樣的:

--- 10.0.0.6 ping statistics --- 
    50 packets transmitted, 49 packets received, 2% packet loss 
    round-trip min/avg/max = 20.917/70.216/147.258 ms 
    1466786342 
    PING 10.0.0.6 (10.0.0.6): 56 data bytes 

    .... 

    --- 10.0.0.6 ping statistics --- 
    50 packets transmitted, 50 packets received, 0% packet loss 
    round-trip min/avg/max = 29.535/65.768/126.983 ms 
    1466786391 

--- 10.0.0.6 ping statistics --- 
    50 packets transmitted, 44 packets received, 12% packet loss 
    round-trip min/avg/max = 30.238/62.772/102.959 ms 
    1466786442 
    PING 10.0.0.6 (10.0.0.6): 56 data bytes 
.... 

也就是說,將兩者連接起來的文件,並創建第三個去除第二文件的副本(文字塊那是已經存在於第一個文件。這裏是我的代碼:

public static void UnionFiles() 
{ 

    string folderPath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http"); 
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat"); 
    var union = Enumerable.Empty<string>(); 

    foreach (string filePath in Directory 
       .EnumerateFiles(folderPath, "*.txt") 
       .OrderBy(x => Path.GetFileNameWithoutExtension(x))) 
    { 
     union = union.Union(File.ReadAllLines(filePath)); 
    } 
    File.WriteAllLines(outputFilePath, union); 
} 

這是錯誤的輸出我得到(文件結構被破壞):

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 49 packets received, 2% packet loss 
round-trip min/avg/max = 20.917/70.216/147.258 ms 
1466786342 
PING 10.0.0.6 (10.0.0.6): 56 data bytes 

--- 10.0.0.6 ping statistics --- 
50 packets transmitted, 50 packets received, 0% packet loss 
round-trip min/avg/max = 29.535/65.768/126.983 ms 
1466786391 
round-trip min/avg/max = 30.238/62.772/102.959 ms 
1466786442 
round-trip min/avg/max = 5.475/40.986/96.964 ms 
1466786492 
round-trip min/avg/max = 5.276/61.309/112.530 ms 

編輯:此代碼被編寫來處理多個文件,但是我很高興,即使只有2可以正確完成。

但是,這並不會刪除textblocks,因爲它會刪除幾條有用的行,並使輸出完全無用。我被卡住了。

如何實現這一目標? 謝謝。

+0

'工會= union.Union(File.ReadAllLines(文件路徑));'這應該不創建布爾結合,從而去除重複塊? –

+0

是的,它應該,我假設格式(UTF8?)或空白問題? – Ouarzy

+0

您需要實際_parse_文件並提取各個塊作爲Ouarzy建議的比較。其他一切都將導致醜陋,無法維護的黑客行爲。 –

回答

3

我想你想比較塊,而不是每行真正的行。

類似的東西應該工作:

public static void UnionFiles() 
{ 
    var firstFilePath = "log1.txt"; 
    var secondFilePath = "log2.txt"; 

    var firstLogBlocks = ReadFileAsLogBlocks(firstFilePath); 
    var secondLogBlocks = ReadFileAsLogBlocks(secondFilePath); 

    var cleanLogBlock = firstLogBlocks.Union(secondLogBlocks); 

    var cleanLog = new StringBuilder(); 
    foreach (var block in cleanLogBlock) 
    { 
     cleanLog.Append(block); 
    } 

    File.WriteAllText("cleanLog.txt", cleanLog.ToString()); 
} 

private static List<LogBlock> ReadFileAsLogBlocks(string filePath) 
{ 
    var allLinesLog = File.ReadAllLines(filePath); 

    var logBlocks = new List<LogBlock>(); 
    var currentBlock = new List<string>(); 

    var i = 0; 
    foreach (var line in allLinesLog) 
    { 
     if (!string.IsNullOrEmpty(line)) 
     { 
      currentBlock.Add(line); 
      if (i == 4) 
      { 
       logBlocks.Add(new LogBlock(currentBlock.ToArray())); 
       currentBlock.Clear(); 
       i = 0; 
      } 
      else 
      { 
       i++; 
      } 
     } 
    } 

    return logBlocks; 
} 

隨着日誌塊定義如下:

public class LogBlock 
{ 
    private readonly string[] _logs; 

    public LogBlock(string[] logs) 
    { 
     _logs = logs; 
    } 

    public override string ToString() 
    { 
     var logBlock = new StringBuilder(); 
     foreach (var log in _logs) 
     { 
      logBlock.AppendLine(log); 
     } 

     return logBlock.ToString(); 
    } 

    public override bool Equals(object obj) 
    { 
     return obj is LogBlock && Equals((LogBlock)obj); 
    } 

    private bool Equals(LogBlock other) 
    { 
     return _logs.SequenceEqual(other._logs); 
    } 

    public override int GetHashCode() 
    { 
     var hashCode = 0; 
     foreach (var log in _logs) 
     { 
      hashCode += log.GetHashCode(); 
     } 
     return hashCode; 
    } 
} 

請小心覆蓋LogBlock平等的,有一個一致的GetHashCode的實現作爲聯盟使用他們兩人,如解釋here

+0

不,我檢查了MSDN示例應用程序。它保留了重複項,它們的一個副本。 –

+0

謝謝,我現在會測試它。你測試過了嗎? –

+1

是的,但我試圖改進它,感謝您的評論,仍然在此。 – Ouarzy

-2

拼接唯一記錄存在問題。 你可以查看下面的代碼嗎?

public static void UnionFiles() 
{ 

    string folderPath =  Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http"); 
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat"); 
    var union =new List<string>(); 

    foreach (string filePath in Directory 
      .EnumerateFiles(folderPath, "*.txt") 
      .OrderBy(x => Path.GetFileNameWithoutExtension(x))) 
    { 
     var filter = File.ReadAllLines(filePath).Where(x => !union.Contains(x)).ToList(); 
    union.AddRange(filter); 

    } 
    File.WriteAllLines(outputFilePath, union); 
} 
+0

同樣的錯誤,我錯過了信息。 –

1

使用正則表達式A,而不哈克溶液:

var logBlockPattern = new Regex(@"(^---.*ping statistics ---$)\s+" 
           + @"(^.+packets transmitted.+packets received.+packet loss$)\s+" 
           + @"(^round-trip min/avg/max.+$)\s+" 
           + @"(^\d+$)\s*" 
           + @"(^PING.+$)?", 
           RegexOptions.Multiline); 

var logBlocks1 = logBlockPattern.Matches(FileContent1).Cast<Match>().ToList(); 
var logBlocks2 = logBlockPattern.Matches(FileContent2).Cast<Match>().ToList(); 

var mergedLogBlocks = logBlocks1.Concat(logBlocks2.Where(lb2 => 
    logBlocks1.All(lb1 => lb1.Groups[4].Value != lb2.Groups[4].Value))); 

var mergedLogContents = string.Join("\n\n", mergedLogBlocks); 

Groups集合的正則表達式Match的包含一個記錄塊的每一行(因爲在圖案中的每個線被包裹在括號())和完整匹配在索引0。因此,索引爲4的匹配組是我們可以用來比較日誌塊的時間戳。

工作實施例:https://dotnetfiddle.net/kAkGll

+0

非常感謝!一個好的解決方案 –

相關問題