2012-08-28 80 views
1

在下面的HTML中,我可以解析table元素,但我不知道如何跳過th元素。解析表與HTML敏捷包

我想只有td元素,但是當我嘗試使用:

foreach (HtmlNode cell in row.SelectNodes("td")) 

...我得到一個異常。

<table class="tab03"> 
    <tbody> 
    <tr> 
     <th class="right" rowspan="2">first</th> 
    </tr> 
    <tr> 
     <th class="right">lp</th> 
     <th class="right">name</th> 
    </tr> 
    <tr> 
     <td class="right">1</td> 
     <td class="left">house</td> 
    </tr> 
    <tr> 
     <th class="right" rowspan="2">Second</th> 
    </tr> 
    <tr> 
     <td class="right">2</td> 
     <td class="left">door</td> 
    </tr> 
    </tbody> 
</table> 

我的代碼:

var document = doc.DocumentNode.SelectNodes("//table"); 
string store = ""; 

if (document != null) 
{ 
    foreach (HtmlNode table in document) 
    { 
     if (table != null) 
     { 
      foreach (HtmlNode row in table.SelectNodes("tr")) 
      { 
       store = ""; 
       foreach (HtmlNode cell in row.SelectNodes("th|td")) 
       { 
        store = store + cell.InnerText+"|"; 
       } 

       sw.Write(store); 
       sw.WriteLine(); 
      } 
     } 
    } 
} 

sw.Flush(); 
sw.Close(); 
+2

什麼異常? –

回答

3

此方法使用LINQ來查詢名稱爲tdHtmlNode實例。

我也注意到你的輸出顯示爲val|val|(尾隨管道),本示例使用string.Join(pipe, array)作爲移除尾隨管道的較不可靠的方法:val|val

using System.Linq; 

// ... 

var tablecollection = doc.DocumentNode.SelectNodes("//table"); 
string store = string.Empty; 

if (tablecollection != null) 
{ 
    foreach (HtmlNode table in tablecollection) 
    { 
     // For all rows with at least one child with the 'td' tag. 
     foreach (HtmlNode row in table.DescendantNodes() 
      .Where(desc => 
       desc.Name.Equals("tr", StringComparison.OrdinalIgnoreCase) && 
       desc.DescendantNodes().Any(child => child.Name.Equals("td", 
        StringComparison.OrdinalIgnoreCase)))) 
     { 
      // Combine the child 'td' elements into an array, join with the pipe 
      // to create the output in 'val|val|val' format. 
      store = string.Join("|", row.DescendantNodes().Where(desc => 
       desc.Name.Equals("td", StringComparison.OrdinalIgnoreCase)) 
       .Select(desc => desc.InnerText)); 

      // You can probably get rid of the 'store' variable as it's 
      // no longer necessary to store the value of the table's 
      // cells over the iteration. 
      sw.Write(store); 
      sw.WriteLine(); 
     } 
    } 
} 

sw.Flush(); 
sw.Close(); 
3

你的XPath語法是不正確的。請嘗試:

HtmlNode cell in row.SelectNodes("//td") 

這將讓你可以與foreach迭代td元素的集合。

+0

有了這個建議,我得到:1 | house | 2 | door,但我想要「td」bolow下一個「td」。 – Wojciech