php正則表達式匹配特定的url模式

-1

我想從幾百個html頁面「抓取」幾百個網址。php正則表達式匹配特定的url模式

模式：

<h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2>

來源

2010-03-28 zen

你的問題是什麼？ – user187291 2010-03-28 09:02:25

+12

他們......只是......從不......停下來。託尼小馬..他來.... .... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – 2010-03-28 09:09:48

'/http:\/\/[^\/]+/[^.]+\.asp\?urlid=\d+/'

但更好的使用HTML解析器，在這裏一個例子PHP Simple HTML DOM

$html = file_get_html('http://www.google.com/'); 

// Find all links 
foreach($html->find('a') as $element) 
     echo $element->href . '<br>';

來源

2010-03-28 09:02:25 YOU

下面是如何與本地DOM擴展做正確

// GET file 
$doc = new DOMDocument; 
$doc->loadHtmlFile('http://example.com/'); 

// Run XPath to fetch all href attributes from a elements 
$xpath = new DOMXPath($doc); 
$links = $xpath->query('//a/@href'); 

// collect href attribute values from all DomAttr in array 
$urls = array(); 
foreach($links as $link) { 
    $urls[] = $link->value; 
} 
print_r($urls);

請注意，上面也會找到相關鏈接。如果你不希望這些調整XPath來

'//a/@href[starts-with(., "http")]'

注意，使用正則表達式匹配HTML是到了瘋狂的道路。正則表達式匹配字符串模式，對HTML元素和屬性一無所知。 DOM會這樣做，這就是爲什麼你應該比Regex更喜歡它，因爲每種情況都超出了匹配Markup的超級字符串模式。

來源

2010-03-28 09:20:07 Gordon

php正則表達式匹配特定的url模式

回答

相關問題