在字符串中搜索字符串（搜索HTML源代碼中的所有hrefs）

我有一個字符串變量，它包含整個網頁的HTML。該網頁將包含指向其他網站的鏈接。我想創建一個所有hrefs（webcrawler像）的列表。什麼是最好的方式來做到這一點？會使用任何擴展功能幫忙嗎？那麼使用Regex呢？在字符串中搜索字符串（搜索HTML源代碼中的所有hrefs）

由於提前

來源

2011-06-17 Ananth

使用DOM解析器如HTML Agility Pack解析您的文檔並找到所有環節。

關於如何使用HTML Agility Pack here有一個很好的問題。這裏有一個簡單的例子，讓你開始：

string html = "your HTML here"; 

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 

doc.LoadHtml(html); 

var links = doc.DocumentNodes.DescendantNodes() 
    .Where(n => n.Name == "a" && n.Attributes.Contains("href") 
    .Select(n => n.Attributes["href"].Value);

來源

2011-06-17 16:22:19 Donut

@甜甜圈：謝謝你啓發我關於HTML敏捷Pack..I以前從未使用過它。 Iam現在正在探索它。 – Ananth

我想你會發現這個回答你的問題與T

http://msdn.microsoft.com/en-us/library/t9e807fx.aspx

來源

2011-06-17 16:22:53

謝謝.. thata一個很好的解決方案 – Ananth

我會去與正則表達式。

 Regex exp = new Regex(
      @"{href=}*{>}", 
      RegexOptions.IgnoreCase); 
     string InputText; //supply with HTTP 
     MatchCollection MatchList = exp.Matches(InputText);

來源

2011-06-17 16:23:03 therealmitchconnors

試試這個正則表達式（應該工作）：

var matches = Regex.Matches (html, @"href=""(.+?)""");

你可以通過比賽和提取所捕獲的URL。

來源

2011-06-17 16:23:05

@ Tim.Thanks這個作品.. – Ananth

你使用過HTMLAGILITYPACK嗎？ http://htmlagilitypack.codeplex.com/

有了這個，您可以簡單地使用XPATH來獲取頁面上的所有鏈接並將它們放入列表中。

private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet) 
{ 
    List<string> hrefTags = new List<string>(); 

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]")) 
    { 
     HtmlAttribute att = link.Attributes["href"]; 
     hrefTags.Add(att.Value); 
    } 

    return hrefTags; 
}

從這裏另一篇文章摘自 - Get all links on html page?

來源

2011-06-17 16:23:29 EvanGWatkins

謝謝..我havnt以前看着HTMLAGILITYPACK ..但現在Iam – Ananth

在字符串中搜索字符串（搜索HTML源代碼中的所有hrefs）

回答

相關問題