找到重複的列並將其替換爲計數

我有一個製表符分隔的文件，它具有重複的命名標題;找到重複的列並將其替換爲計數

[Column1] \t [Column2] \t [test] \t [test] \t [test] \t [test] \t [Column3] \t [Column4]

我想要做的，是重新命名是重複的[測試]用整數列。所以會成爲像

[Column1] \t [Column2] \t [test1] \t [test2] \t [test3] \t [test4] \t [Column3] \t [Column4]

到目前爲止，我可以隔離的第一行。再算上我發現

string destinationUnformmatedFileName = @"C:\New\20130816_Opportunities_unFormatted.txt"; 
string destinationFormattedFileName = @"C:\New\20130816_Opportunities_Formatted.txt"; 
var unformattedFileStream = File.Open(destinationUnformmatedFileName, FileMode.Open, FileAccess.Read); // Open (unformatted) file for reading 
var formattedFileStream = File.Open(destinationFormattedFileName, FileMode.Create, FileAccess.Write); // Create (formattedFile) for writing 

StreamReader sr = new StreamReader(unformattedFileStream); 
StreamWriter sw = new StreamWriter(formattedFileStream); 

int rowCounter = 0; 
// Read each row in the unformatted file 
while ((currentRow = sr.ReadLine()) != null) 
{ 
    //First row, lets check for duplicate names 
    if (rowCounter = 0) 
    { 

    // Write column name to array 
    string delimiter = "\t"; 
    string[] fieldNames = currentRow.Split(delimiter.ToCharArray()); 

    foreach (string fieldName in fieldNames) 
    { 
     // fieldName must be followed by a tab for it to be a duplicate 
     // original code - causing the issue 
     //Regex rgx = new Regex("\\t(" + fieldName + ")\\t"); 
     // Edit - resolved the issue 
     Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+"); 

     // Count how many occurances of fieldName in currentRow 
     int count = rgx.Matches(currentRow).Count;    
     //MessageBox.Show("Match Count = " + count.ToString()); 

     // If we have a duplicate field name 
     if (count > 1)           
     { 
      string newFieldName = "\t" + fieldName + count.ToString() + "\t"; 
      //MessageBox.Show(newFieldName); 
      currentRow = rgx.Replace(currentRow, newFieldName, 1); 
     } 
    } 
    } 
rowCounter++; 
}

我覺得我在正確的軌道上比賽，但我不認爲的是正常工作的正則表達式？

編輯：我想我已經想通了如何找到使用模式;

Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+");

它不是一個交易斷路器，但現在唯一的問題是，它標籤;

[Column1] \t [Column2] \t [test4] \t [test3] \t [test2] \t [test] \t [Column3] \t [Column4]

相反

[Column1] \t [Column2] \t [test1] \t [test2] \t [test3] \t [test4] \t [Column3] \t [Column4]

來源

2013-08-26 Chris Hillman

「我不認爲正則表達式工作正常」聽起來像你甚至不確定是否有一個問題。什麼不工作？你有例外嗎？錯誤的結果？沒有結果？另外，你可能希望爲你的模式使用逐字字符串以避免雙重轉義：'@「\ t（'。其次，你應該在將'regex.Escape（）'連接成模式之前運行'fieldName'，因爲它可能包含元字符 –

關於你的編輯，如果修改它，那麼問題是匹配永遠不會重疊，因爲你在字段名稱前後需要一個'\ t'，所以相鄰字段的匹配會重疊。這是一個很好的解決方法，另外，請將您的解決方案作爲答案（並接受它，如果你沒有得到一個更好的） –

謝謝m.buettner - 我已經發布了答案，但需要等待2天才能接受。感覺不好，現在浪費人們的時間應該等待一段時間，再研究一下。感謝您的幫助！ –

使用下面

Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+");

解決使用環視，我發現這裏的問題; http://www.regular-expressions.info/duplicatelines.html

可能應該在發佈前花費幾分鐘的時間研究它。

來源

2013-08-27 00:40:02

測試您的正則在RegExr首。我認爲「\ t」是一個特殊字符。嘗試「\\ t」。在你的C＃這將是「\\\\ T」

來源

2013-08-27 00:27:45

他做到了，反正也沒關係。正則表達式引擎可以處理實際的製表符以及轉義的\ t。 –

這裏是Regex和LINQ之間的大組合：

var input = @"[Column1] \t [Column2] \t [test] \t [test] \t [test] \t [foo] \t [test] \t [Column3] \t [foo] \t [Column4]"; 
Regex reg = new Regex(@"(?<=\\t)[[](.+?)[]]"); 
string output = ""; 
int k = 0;   
foreach (var m in reg.Matches(input) 
        .OfType<Match>() 
        .Select((x,i)=>new {x,i}) 
        .GroupBy(g=>g.x.Value) 
        .Where(g=>g.Count()>1) 
        .SelectMany(x=> x.Select((a,i)=>new {a,i=i+1})) 
        .OrderBy(x=>x.a.i)){       
    output += input.Substring(k, m.a.x.Index - k) + m.a.x.Result("[${1}" + m.i + "]"); 
    k = m.a.x.Index + m.a.x.Length; 
} 
output += input.Substring(k);

結果： [column1的] \噸[列2] \噸[TEST1] \噸[TEST2] \噸[TEST3] \ t [foo1] \ t [test4] \ t [Column3] \ t [foo2] \ t [Column4]

來源

2013-08-27 03:55:53

找到重複的列並將其替換爲計數

回答

相關問題