2013-09-25 64 views
3

我試圖使用DOM來提取HTML頁面的鏈接:用於匹配和刪除URL的PHP​​ Regex或DOMDocument?

$html = file_get_contents('links.html'); 
$DOM = new DOMDocument(); 
$DOM->loadHTML($html); 
$a = $DOM->getElementsByTagName('a'); 
foreach($a as $link){ 
    //echo out the href attribute of the <A> tag. 
    echo $link->getAttribute('href').'<br/>'; 
} 

輸出:

http://dontwantthisdomain.com/dont-want-this-domain-name/ 
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/ 
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/ 
http://domain1.com/page-X-on-domain-com.html 

http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html 
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/ 
http://domain.com/page-XZ-on-domain-com.html 

http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/ 
http://dontwantthisdomain2.com/same-as-above/ 
http://domain3.com/page-XYZ-on-domain3-com.html 

我想刪除所有結果匹配dontwantthisdomain.com,dontwantthisdomain2.com和dontwantthisdomain3.com所以輸出將看起來像這樣:

http://domain1.com/page-X-on-domain-com.html 
http://domain.com/page-XZ-on-domain-com.html 
http://domain3.com/page-XYZ-on-domain3-com.html 

有些人說我不應該使用正則表達式對HTML和其他人,這是確定。有人可以指出我如何從我的html文件中刪除不需要的URL? :)

+0

嗯,你的腳本的剩餘輸出沒有任何HTML更多的,是嗎?因此,一旦用DOM解析器從HTML中獲取鏈接,通過正則表達式進行篩選就非常好。雖然在這種情況下,可能有更簡單的選擇。例如,你可以使用['parse_url'](http://php.net/manual/en/function.parse-url.php)獲得域名(* host *),然後檢查它是否在黑名單中不需要的域名。 –

回答

1

無解的正則表達式(不含潛在錯誤:-):

$html=' 
http://dontwantthisdomain.com/dont-want-this-domain-name/ 
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/ 
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/ 
http://domain1.com/page-X-on-domain-com.html 

http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html 
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/ 
http://domain.com/page-XZ-on-domain-com.html 

http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/ 
http://dontwantthisdomain2.com/same-as-above/ 
http://domain3.com/page-XYZ-on-domain3-com.html 
'; 

$html=explode("\n", $html); 
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com'); 
foreach ($html as $link) { 
    $ok=true; 
    foreach($dontWant as $notWanted) { 
     if (strpos($link, $notWanted)>0) { 
      $ok=false; 
     } 
     if (trim($link=='')) $ok=false; 
    } 
    if ($ok) $final_result[]=$link; 
} 

echo '<pre>'; 
print_r($final_result); 
echo '</pre>'; 

輸出

Array 
(
    [0] => http://domain1.com/page-X-on-domain-com.html 
    [1] => http://domain.com/page-XZ-on-domain-com.html 
    [2] => http://domain3.com/page-XYZ-on-domain3-com.html 
) 
2

也許是這樣的:

function extract_domains($buffer, $whitelist) { 
    preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches); 
    $result = array(); 
    foreach($matches[1] as $url) { 
     $url = urldecode($url); 
     $parts = @parse_url((string) $url); 
     if ($parts !== false && in_array($parts['host'], $whitelist)) { 
      $result[] = $parts['host']; 
     } 
    } 
    return $result; 
} 

$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com'))); 

它做一個粗略的比賽對所有<a>href=,抓住引號之間有什麼,然後對其過濾根據您的域的白名單。

+0

爲什麼'@ parse_url'?只是問問。抑制錯誤不是一個好主意。 – davidkonrad

+0

如果'parse_url()'失敗,它會產生一個警告(或至少一個通知)。由於這些數據是我認爲「用戶輸入」的數據,因此不會告知'href =「」'屬性中將放置什麼類型的詭計。我抑制了錯誤,然後用下面的'if-statement'中的strict not equals手動檢查它。 – SamT