2012-06-13 45 views
0

This is a followup of a previous question I had.運行了一個問題,試圖從htmlnode鏈接使用htmlagiliypack

I got the very excellent link parsing code from here.

所以我有以下形式的HTML:

<html> 
    <head> 
     RANDOM JAVASCRIPT AND CSS AHHHHHH!!!!!!!! 
    </head> 
    <body> 
     <a href="/Random/link/here">Random</a> 
     <a href="/Random/link/here">Random</a> 
     <a href="/Random/link/here">Random</a> 
     <a href="/Random/link/here">Random</a> 
     <a href="/Random/link/here">Random</a> 
     <a href="/Random/link/here">Random</a> 
     <table class="table"> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
     </table> 
    <body> 
</html> 

和我有下面的代碼,以創建目的是提取包含在信息中的信息,然後提取該信息的鏈接:

public class MainClass 
{ 
    public static void Main(String[] args) 
    { 
     string url = args[1]; 
     Extractinfo pageScrape = new Extractinfo(); 
     pageScrape.RenderPage(url); 
    } 
} 
public class Extractinfo 
{ 
    public HtmlDocument RenderPage(string url) 
    { 
     try 
     { 
       HtmlDocument pageSource = new HtmlDocument(); 
       var webGet = new HtmlWeb(); 
       pageSource = webGet.Load(url); 

       ExtractLinks(pageSource); 
     } 
     catch (WebException e) 
     { 
      Console.WrtieLine(e.Message + ": " + e.StackTrace); 
     } 
    } 

    private List<string> ExtractHrefTags(HtmlNode htmlSnippet) 
     { 
      List<string> hrefTags = new List<string>(); 

      foreach (HtmlNode link in htmlSnippet.SelectNodes("//a[@href]")) 
      { 
       HtmlAttribute att = link.Attributes["href"]; 
       hrefTags.Add(att.Value); 
      } 

      return hrefTags; 
     } 

     public void ExtractLinks(HtmlDocument pagesource) 
     { 

      var elements = pagesource.DocumentNode.SelectNodes("//table[@class='table']"); 
      List<string> hrefTags = new List<string>(); 
      foreach (var ele in elements) 
      { 
       hrefTags = ExtractHrefTags(ele); 
      } 
     } 
    } 
} 

現在,代替只獲得<table class="table>*****</table>內部的鏈接,此代碼將頁面上的所有鏈接置於List hreftags中。我在這裏做錯了什麼?我如何解決這個錯誤,以便提取的唯一鏈接是那些生活在<table class="table>*****</table>之內的鏈接?

謝謝你的幫助!

回答

1

您需要添加一個「。」到你的XPath來匹配表的子節點,像這樣:

htmlSnippet.SelectNodes(".//a[@href]") 
+0

我覺得自己像一個完全的白癡......我盯着這個好像30分鐘,想不出來。謝謝! – gfppaste