正則表達式包括與其他正則表達式

-1

&lt;a href="http://www.somesite.com/" target="_blank"&gt;

而且已經挖出了這個互聯網上的正則表達式，以確定該字符串的URL部分。

\ b（https？| ftp | file）：// [-A-Z0-9 + & @＃/％？=〜_ |！：，。;] * [ - A-Z0-9 + & @＃/％=〜_ |]

然而，這正則表達式是不包括封閉的轉義HTML文本<a href="和" target="_blank">。

我需要能夠識別大文檔中的完整字符串，並且包括爲上述字符串的非轉義HTML部分組成額外的正則表達式。正則表達式爲了找到上面的字符串會是什麼樣子？

謝謝！

來源

2011-11-14 Isaiah Nelson

[你不應該試圖解析與正則表達式HTML（HTTP：// stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454） – Bohemian

正則表達式可能不是一個好主意與HTML。但是，由於你有一個奇怪的用作標記的字符引用，它可能不是真正的html。

這Perl的樣本可能會奏效，但我真的不知道：

use strict; 
use warnings; 

my $samp = ' 
&lt;a href="http://www.somesite.com/" target="_blank"&gt; 
<a target="_blank" href="http://www.someothersite.com/" &gt; 
'; 

my $regex = qr{ 
(
(?:<|&lt;)a 
    (?=\s) (?:(?!&gt;|>)[\S\s])* 
    (?<=\s) href \s* = \s* 
     " \s* ((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]) \s* " 
    (?:(?!&gt;|>)[\S\s])* (?<!/) 
(?:>|&gt;) 
) 
}x; 


while ($samp =~ /$regex/g) { 
    print "In: '$1'\nfound: '$2'\n--------\n"; 
}

輸出：

In: '&lt;a href="http://www.somesite.com/" target="_blank"&gt;' 
found: 'http://www.somesite.com/' 
-------- 
In: '<a target="_blank" href="http://www.someothersite.com/" &gt;' 
found: 'http://www.someothersite.com/' 
--------

來源

2011-11-15 00:19:08 sln

正則表達式包括與其他正則表達式

回答

相關問題