正則表達式來查找HTML字符串中的所有路徑

我有一個字符串，帶有htmlentities編碼的HTML代碼。正則表達式來查找HTML字符串中的所有路徑

我想要做的就是找到文檔中的所有路徑之間：

HREF = 「XXX」，SRC = 「XXX」。

我確實有這種發現所有的環節開始通過HTTP，HTTPS，FTP和文件正則表達式的表情，又免得我遍歷它：

"/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&@#\/%=~_|$?!:,.]*[A-Z0-9+&@#\/%=~_|$]/i"

任何想法？

來源

2013-02-08 Bernat

爲什麼不試着在'href =「'和下一個'''之間找到一切？這將*更容易和*更少*容易出錯。 – zerkms 2013-02-08 22:26:53

'href =「（[^」] *）怎麼樣？''是否允許在URL中？我認爲空間實際上是...... – 2013-02-08 22:44:25

@P O'Conbhui：不允許使用空格，以及使用'「字符 – zerkms 2013-02-09 05:22:28

更新：用正則表達式做它是不可靠的。 src =「..」或href =「..」語句可以是評論或javascript語句的一部分。爲了獲得可靠的鏈接，我建議使用XPath：

<?php 

$html = file_get_contents('http://stackoverflow.com/questions/14782334/regex-expression-to-find-all-paths-in-a-html-string/14782594#14782594'); 
$doc = new DOMDocument(); 
@$doc->loadHTML($html); 
$selector = new DOMXPath($doc); 

$result = $selector->query('//a/@href | //@src'); 
foreach($result as $link) { 
    echo $link->value, PHP_EOL; 
}

如果使用正則表達式我會盡力搶=在href或src屬性的"之間的內容。這裏談到一個例子，如何從使用正則表達式獲得此鏈接頁面：

<?php 

$html = file_get_contents('http://stackoverflow.com/questions/14782334/regex-expression-to-find-all-paths-in-a-html-string'); 

preg_match_all('/href="(?P<href>.*)"|src="(?P<src>.*)"/U', $html, $m); 
                 <--- note the U to make the 
                  pattern ungreedy 
var_dump($m['href']); 
var_dump($m['src']);

來源

2013-02-08 22:47:10 hek2mgl

你可以使用DOM來查找特定標籤的所有鏈接。例如，要獲得從錨標記網址，這樣做（未經測試，但它應該指向你在正確的方向）：

function findPaths($url) 
{ 
    $dom = new DOMDocument(); 

    //$url of page to search, the "@' is there to suppress warnings 
    @$dom->loadHTMLFile($url) 

    $paths = array(); 
    foreach($dom->getElementsByTagName('a') as $path) 
    { 
    $paths[] = array('url' => $path->getAttribute('href'), text => $path->nodeValue); 
    } 
    return $paths; 
}

你可以使用XPath來加載和評估DOM使其更容易。

來源

2013-02-08 23:09:37 Jack

正則表達式來查找HTML字符串中的所有路徑

回答

相關問題