C＃正則表達式：從多個「a href」標籤獲取URL和文本

我希望能夠抓取包含多個「<a href」標籤的網頁並返回它們的結構化集合。C＃正則表達式：從多個「a href」標籤獲取URL和文本

<div> 
    <p>Lorem ipsum... <a href="https://stackoverflow">Classic link</a> 
     <a title="test" href=http://sloppy-html-5-href.com>I lovez HTML 5</a> 
    </p> 
    <a class="abc" href='/my-tribute-to-javascript.html'>I also love JS</a> 
    <iframe width="420" height="315" src="http://www.youtube.com/embed/JVPT4h_ilOU" 
     frameborder="0" allowfullscreen></iframe><!-- Don't catch me! --> 
</div>

所以我想這些值：

https://stackoverflow |經典鏈接
http://sloppy-html-5-href.com |我lovez HTML 5
/my-tribute-to-javascript.html |我也喜歡JS

正如你所看到的，只有在「A HREF」的價值觀應該被抓，與標籤中同時鏈接和內容。它應該支持所有HTML 5有效的href。 href屬性可以被任何其他屬性包圍。

所以我基本上要一個正則表達式來填寫下面的代碼：

public IEnumerable<Tuple<string, string>> GetLinks(string html) { 
    string pattern = string.Empty; // TODO: Get solution from Stackoverflow 
    var matches = Regex.Matches(html, pattern); 

    foreach(Match match in matches) { 
     yield return new Tuple<string, string>(
      match.Groups[0].Value, match.Groups[1].Value); 
    } 
}

來源

2011-11-08 Seb Nilsson

_「TODO：從解決方案＃1 「_ - 真的嗎？如何「TODO：試圖找出解決方案，如果我卡住檢查StackOverflow」？ – nnnnnn

@nnnnnn明白了，不允許開玩笑......非常有建設性的評論。 –

我的歉意，當然是開玩笑是允許的。在我睡眠不足的狀態下，我沒有意識到這是一個笑話，或者我不會發表評論。（我有時會發布「迄今爲止您嘗試過什麼？」類型的評論，但公平地說，您的問題提供了大量關於您的需求和一些代碼的詳細信息，因此它不適合通常的「爲我工作」問題。） – nnnnnn

我一直讀了正則表達式解析HTML是邪惡的。好吧......這是肯定真的...
但像邪惡，正則表達式是如此有趣:)
所以我很想嘗試這一個：

Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>"); 

foreach (Match match in r.Matches(html)) 
    yield return new Tuple<string, string>(
     match.Groups["href"].Value, match.Groups["value"].Value);

來源

2011-11-08 10:56:19 PierrOz

不是它更容易使用html agility pack和XPath？比正則表達式

它會像

var webGet = new HtmlWeb(); 
var document = webGet.Load(url); 
var aNodeCollection = document.DocumentNode.Descendants("//a[@href]") 

foreach (HtmlNode node id aNodeCollection) 
{ 
node.Attributes["href"].value 
node.htmltext 
}

其僞代碼

來源

2011-11-08 10:35:04 WKordos

有趣的方法，但它具體說HTML 5，它不一定是有效的XML。 –

我還沒有時間潛入html5所以沒有知道它允許格式不正確的文件（看起來像退後一步）但我仍然會試試，敏捷包即使與討厭的htmls也適合我，它會很好地消毒它們 – WKordos

C＃正則表達式：從多個「a href」標籤獲取URL和文本

回答

相關問題