從網頁解析超鏈接

我已經編寫了下面的代碼來解析來自給定頁面的超鏈接。從網頁解析超鏈接

WebClient web = new WebClient(); 
    string html = web.DownloadString("http://www.msdn.com"); 
    string[] separators = new string[] { "<a ", ">" }; 
    List<string> hyperlinks= html.Split(separators, StringSplitOptions.None).Select(s => 
    { 
     if (s.Contains("href")) 
      return s; 
     else 
      return null; 
    }).ToList();

儘管字符串拆分仍然需要調整以完美地返回url。我的問題是有一些數據結構，就是XmlReader的行，它可以有效地讀取HTML字符串。

任何有關改進上述代碼的建議也會有所幫助。

謝謝你的時間。

來源

2012-09-26 Abhijeet

喜，只是想知道：你有沒有發現任何有用的答案，你的問題？ – Thousand

@千你的答案是正確的。謝謝。 – Abhijeet

嘗試HtmlAgilityPack

 HtmlWeb hw = new HtmlWeb(); 
     HtmlDocument doc = hw.Load("http://www.msdn.com"); 
     foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) 
     { 
      Console.WriteLine(link.GetAttributeValue("href", null));   
     }

這將打印出的每一個環節上的網址。

如果要存儲在列表中的鏈接

：

var linkList = doc.DocumentNode.SelectNodes("//a[@href]") 
       .Select(i => i.GetAttributeValue("href", null)).ToList();

來源

2012-09-26 22:15:43 Thousand

您應該使用一個分析器。使用最廣泛的是HtmlAgilityPack。使用它，你可以作爲一個DOM與HTML進行交互。

來源

2012-09-26 21:50:56

重構，

 var html = new WebClient().DownloadString("http://www.msdn.com"); 
     var separators = new[] { "<a ", ">" }; 
     html.Split(separators, StringSplitOptions.None).Select(s => s.Contains("href") ? s : null).ToList();

來源

2012-09-26 21:52:22

假設你正在處理的很好的XHTML，你可以簡單地把文本作爲XML文檔。該框架加載了的功能，完全符合您的要求。

http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx

Does .NET framework offer methods to parse an HTML string?

來源

2012-09-26 22:15:18 Doug

如果安裝新工具不可行，那麼您已經展示了一種好方法。謝謝。 – Abhijeet

從網頁解析超鏈接

回答

相關問題