發現字符串後分割大型XML文件

大型XML文件@近100萬行內容。內容實例：

<etc35yh3 etc="numbers" etc234="a" etc345="date"><something><some more something></some more something></something></etc123> 
<etc123 etc="numbers" etc234="a" etc345="date"><something><some more something></some more something></something></etc123> 
<etc15y etc="numbers" etc234="a" etc345="date"><something><some more something></some more something></something></etc123>

^重複，通過900K左右線（當然內容變更）

我需要什麼：

搜索XML文件"<etc123"。一旦找到將該行及其下面的所有行移動（寫入）到單獨的XML文件。

爲搜索部分使用File.ReadAllLines這樣的方法是否可取？你會怎麼建議寫作部分？就我所知，逐行不是一種選擇，因爲它需要很長時間。

來源

2012-10-04 Ray Alex

'有效地丟棄它上面的內容。'這是什麼意思 – Anirudha

生成的文件應該是有效的XML嗎？ –

@Airirha喜歡無視它 - 又名，不寫它（忽略它） –

複製到相當literaly放棄搜索字符串上面的內容，我不會用File.ReadAllLines，因爲這將整個文件加載到內存中。嘗試File.Open並將其包裝在StreamReader中。在StreamReader.ReadLine上循環，然後開始寫入新的StreamWriter，或者在底層文件流上進行字節拷貝。

下面列出瞭如何使用StreamWriter/StreamReader單獨執行此操作的示例。

//load the input file 
//open with read and sharing 
using (FileStream fsInput = new FileStream("input.txt", 
    FileMode.Open, FileAccess.Read, FileShare.Read)) 
{ 
    //use streamreader to search for start 
    var srInput = new StreamReader(fsInput); 
    string searchString = "two"; 
    string cSearch = null; 
    bool found = false; 
    while ((cSearch = srInput.ReadLine()) != null) 
    { 
     if (cSearch.StartsWith(searchString, StringComparison.CurrentCultureIgnoreCase) 
     { 
      found = true; 
      break; 
     } 
    } 
    if (!found) 
     throw new Exception("Searched string not found."); 

    //we have the data, write to a new file 
    using (StreamWriter sw = new StreamWriter(
     new FileStream("out.txt", FileMode.OpenOrCreate, //create or overwrite 
      FileAccess.Write, FileShare.None))) // write only, no sharing 
    { 
     //write the line that we found in the search 
     sw.WriteLine(cSearch); 

     string cline = null; 
     while ((cline = srInput.ReadLine()) != null) 
      sw.WriteLine(cline); 
    } 
} 

//both files are closed and complete

來源

2012-10-04 19:21:19 Mitch

File.ReadLines也可以。 –

+1，因爲它回答了好評的問題......順便說一句，字節拷貝將會很難，因爲人們不知道編碼......而且，如果在XML中指定，StreamReader將無法處理非UTF8/16編碼。（懷疑是否有人在乎）。 –

@AlexeiLevenkov，我同意Re：字節拷貝，我曾希望有一種方法可以與StreamReader一起尋找，但忽略考慮緩衝進行。除非文檔非常大，或者有很重要的性能要求，否則我可能會留在StreamReader中。 – Mitch

您可以LINQ2XML

XElement doc=XElement.Load("yourXML.xml"); 
XDocument newDoc=new XDocument(); 

foreach(XElement elm in doc.DescendantsAndSelf("etc123")) 
{ 
newDoc.Add(elm); 
} 

newDoc.Save("yourOutputXML.xml");

來源

2012-10-04 19:20:36 Anirudha

我也會考慮這一點 - 謝謝！ –

你可以在同一時間做同一行......不會用讀來結束，如果每行的檢查內容。

FileInfo file = new FileInfo("MyHugeXML.xml"); 
FileInfo outFile = new FileInfo("ResultFile.xml"); 

using(FileStream write = outFile.Create()) 
using(StreamReader sr = file.OpenRead()) 
{ 
    bool foundit = false; 
    string line; 
    while((line = sr.ReadLine()) != null) 
    { 
     if(foundit) 
     { 
      write.WriteLine(line); 
     } 
     else if (line.Contains("<etc123")) 
     { 
      foundit = true; 
     } 
    } 
}

請注意，根據您的要求，此方法可能無法生成有效的XML。

來源

2012-10-04 19:27:31 iMortalitySX

我也會考慮這一點 - 謝謝！ –

發現字符串後分割大型XML文件

回答

相關問題