2010-08-03 80 views
0

我已經將大文檔從Word轉換爲HTML。這很接近,但我有一堆「代碼」節點,我想合併成一個「前」節點。將節點與Html Agility Pack結合起來的最佳方法

這裏的輸入:

<p>Here's a sample MVC Controller action:</p> 
<code>  public ActionResult Index()</code> 
<code>  {</code> 
<code>   return View();</code> 
<code>  }</code> 
<p>We'll start by making the following changes...</p> 

我希望把它變成這樣,而是:

<p>Here's a sample MVC Controller action:</p> 
<pre class="brush: csharp">  public ActionResult Index() 
    { 
     return View(); 
    }</pre> 
<p>We'll start by making the following changes...</p> 

我最後寫一個暴力循環,尋找個連續的迭代節點,但這看起來對我來說很難看:

HtmlDocument doc = new HtmlDocument(); 
doc.Load(file); 

var nodes = doc.DocumentNode.ChildNodes; 
string contents = string.Empty; 

foreach (HtmlNode node in nodes) 
{ 

    if (node.Name == "code") 
    { 
     contents += node.InnerText + Environment.NewLine; 
     if (node.NextSibling.Name != "code" && 
      !(node.NextSibling.Name == "#text" && node.NextSibling.NextSibling.Name == "code") 
      ) 
     { 
      node.Name = "pre"; 
      node.Attributes.RemoveAll(); 
      node.SetAttributeValue("class", "brush: csharp"); 
      node.InnerHtml = contents; 
      contents = string.Empty; 
     } 
    } 
} 

nodes = doc.DocumentNode.SelectNodes(@"//code"); 
foreach (var node in nodes) 
{ 
    node.Remove(); 
} 

通常我會刪除第一個循環中的節點,但不會w在迭代過程中,因爲在迭代它時無法更改集合,

更好的主意?

回答

2

第一種方法:選擇所有的<code>節點,它們分組,並創建每組一個<pre>節點:

var idx = 0; 
var nodes = doc.DocumentNode 
    .SelectNodes("//code") 
    .GroupBy(n => new { 
     Parent = n.ParentNode, 
     Index = n.NextSiblingIsCode() ? idx : idx++ 
    }); 

foreach (var group in nodes) 
{ 
    var pre = HtmlNode.CreateNode("<pre class='brush: csharp'></pre>"); 
    pre.AppendChild(doc.CreateTextNode(
     string.Join(Environment.NewLine, group.Select(g => g.InnerText)) 
    )); 
    group.Key.Parent.InsertBefore(pre, group.First()); 

    foreach (var code in group) 
     code.Remove(); 
} 

這裏的分組字段是父節點的組合的字段和組索引被增加當找到新組時。 此外,我使用NextSiblingIsCode擴展方法這裏:

public static bool NextSiblingIsCode(this HtmlNode node) 
{ 
    return (node.NextSibling != null && node.NextSibling.Name == "code") || 
     (node.NextSibling is HtmlTextNode && 
     node.NextSibling.NextSibling != null && 
     node.NextSibling.NextSibling.Name == "code"); 
} 

它用來確定下一個同級是否是<code>節點。


第二種方法:只選擇每個組的頂部 <code>節點,然後遍歷每個節點,找到下一個節點,直到第一個節點爲非節點。我在這裏使用了 xpath

var nodes = doc.DocumentNode.SelectNodes(
    "//code[name(preceding-sibling::*[1])!='code']" 
); 
foreach (var node in nodes) 
{ 
    var pre = HtmlNode.CreateNode("<pre class='brush: csharp'></pre>"); 
    node.ParentNode.InsertBefore(pre, node); 
    var content = string.Empty; 
    var next = node; 
    do 
    { 
     content += next.InnerText + Environment.NewLine; 
     var previous = next; 
     next = next.SelectSingleNode("following-sibling::*[1][name()='code']"); 
     previous.Remove(); 
    } while (next != null); 
    pre.AppendChild(doc.CreateTextNode(
     content.TrimEnd(Environment.NewLine.ToCharArray()) 
    )); 
}