摘自HTML所有圖片網址除了那些註釋掉

我使用這個正則表達式來獲取所有圖像的URL在HTML文件中：摘自HTML所有圖片網址除了那些註釋掉

(?<=img\s*\S*src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

有什麼辦法來修改這個正則表達式來排除任何IMG標記，用html評論「」註釋掉？

來源

2012-02-24 Andrey

爲什麼不使用適當的HTML解析器呢？ – 2012-02-24 18:01:19

[小馬他來...]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454） – 2012-02-24 18:02:34

@Pekka：因爲我無法保證html是100％「正確」的 - 應用程序從非IT人員那裏獲得，所以很可能會出現[糟糕] html格式錯誤。 – Andrey 2012-02-24 18:06:21

如果您正則表達式已經適用於提取圖像（這本身就是一個奇蹟），考慮一個正則表達式來剝離HTML註釋，像這樣：

<!--.*?-->

替換爲空字符串，以及任何圖片評論內部將不再顯示在您的其他正則表達式中。或者，如果您使用PHP（您沒有標記編程語言），則可以使用strip_tags function和"<img>"作爲「允許標記」參數。這將刪除HTML註釋以及可能干擾您的正則表達式的其他標籤。

來源

2012-02-24 18:05:31

這可能實際上工作，謝謝！讓我試試... – Andrey 2012-02-24 18:08:35

是的，正則表達式已經可以提取圖像URL了。 – Andrey 2012-02-24 18:11:13

當使用HTML敏捷包時，它實際上也很簡單，那裏有一堆設置可幫助修復壞HTML，如果需要的話。像：

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc.OptionAutoCloseOnEnd = true; 
doc.OptionCheckSyntax = false; 
doc.OptionFixNestedTags = true; 
// etc, just set them before calling Load or LoadHtml

http://htmlagilitypack.codeplex.com/

string textToExtractSrcFrom = "... your text here ..."; 

doc.LoadHtml(textToExtractSrcFrom); 

var nodes = doc.DocumentNode.SelectNodes("//img[@src]") ?? new HtmlNodeCollection(); 
foreach (var node in nodes) 
{ 
    string src = node.Attributes["src"].Value; 
} 

//or 
var links = nodes.Select(node => node.Attributes["src"].Value);

來源

2012-02-24 22:10:10 jessehouwing

摘自HTML所有圖片網址除了那些註釋掉

回答

相關問題