解析鏈接文本

我有可能包含像這樣的鏈接的一些文本的模式：解析鏈接文本

<a rel="nofollow" target="_blank" href="http://loremipsum.net/">http://loremipsum.net/</a> 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, <a rel="nofollow" target="_blank" href="http://loremipsum.net/">http://loremipsum.net/</a> sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

我想找到這個文本中的鏈接（a標籤），什麼是對的正則表達式模式？

這種模式不起作用：

const string UrlPattern = @"(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?"; 
var urlMatches = Regex.Matches(text, UrlPattern);

感謝

來源

2014-03-12 user3293835

一個正則表達式可以解析任何和所有''標籤，這對黑盒子來說是一個巨大的難以維繫的怪物。那是你想要的嗎？ – Jon

您會考慮使用另一種解決方案，而不是正則表達式，例如HtmlAgilityPack？如果是這樣，你可以在以後避免很多痛苦 – samy

這是一個只包含'a'標籤的文本。而不是HTML – user3293835

我建議使用HtmlAgilityPack解析HTML（可從它的NuGet）：

HtmlDocument doc = new HtmlDocument(); 
doc.LoadHtml(html); 
var links = doc.DocumentNode.SelectNodes("//a[@href]") 
       .Select(a => a.Attributes["href"].Value);

結果：

[ 
    "http://loremipsum.net/", 
    "http://loremipsum.net/" 
]

推薦閱讀：Parsing Html The Cthulhu Way

來源

2014-03-12 13:38:04

我的名字是samy，我同意這個答案 – samy

@samy我的名字是謝爾蓋，並且我說謝謝你:) –

也許這樣

Regex regexObj = new Regex(@"<a.+?href=(['|""])(.+?)\1"); 
resultString = regexObj.Match(subjectString).Groups[2].Value;

火柴

StringCollection resultList = new StringCollection(); 

Regex regexObj = new Regex(@"<a.+?href=(['|""])(.+?)\1"); 
Match matchResult = regexObj.Match(subjectString); 
while (matchResult.Success) { 
    resultList.Add(matchResult.Groups[2].Value); 
    matchResult = matchResult.NextMatch(); 
}

來源

2014-03-12 13:40:48

你能添加更多描述嗎？ –

更改代碼 - 使用正則表達式（從標籤的href屬性中選擇url） –

你應該使用XML解析器這是更爲穩健和可靠在這樣的任務列表。但是，如果你想要的東西非常快速非常髒，那就是：

<a.*?<\/a>

如果這是太簡單了，你需要捕捉的鏈接地址或鏈接內容，請與本：

<a.*?href="(?<address>.*?)".*?>(?<content>.*?)<\/a>

它們都不匹配正確的嵌套標籤。

來源

2014-03-12 13:40:54 BlackBear

解析鏈接文本

回答

相關問題