用於匹配和刪除URL的PHP Regex或DOMDocument？

我試圖使用DOM來提取HTML頁面的鏈接：用於匹配和刪除URL的PHP Regex或DOMDocument？

$html = file_get_contents('links.html'); 
$DOM = new DOMDocument(); 
$DOM->loadHTML($html); 
$a = $DOM->getElementsByTagName('a'); 
foreach($a as $link){ 
    //echo out the href attribute of the <A> tag. 
    echo $link->getAttribute('href').'<br/>'; 
}

輸出：

http://dontwantthisdomain.com/dont-want-this-domain-name/ 
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/ 
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/ 
http://domain1.com/page-X-on-domain-com.html 

http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html 
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/ 
http://domain.com/page-XZ-on-domain-com.html 

http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/ 
http://dontwantthisdomain2.com/same-as-above/ 
http://domain3.com/page-XYZ-on-domain3-com.html

我想刪除所有結果匹配dontwantthisdomain.com，dontwantthisdomain2.com和dontwantthisdomain3.com所以輸出將看起來像這樣：

http://domain1.com/page-X-on-domain-com.html 
http://domain.com/page-XZ-on-domain-com.html 
http://domain3.com/page-XYZ-on-domain3-com.html

有些人說我不應該使用正則表達式對HTML和其他人，這是確定。有人可以指出我如何從我的html文件中刪除不需要的URL？ :)

來源

2013-09-25 Kris

嗯，你的腳本的剩餘輸出沒有任何HTML更多的，是嗎？因此，一旦用DOM解析器從HTML中獲取鏈接，通過正則表達式進行篩選就非常好。雖然在這種情況下，可能有更簡單的選擇。例如，你可以使用['parse_url']（http://php.net/manual/en/function.parse-url.php）獲得域名（* host *），然後檢查它是否在黑名單中不需要的域名。 –

無解的正則表達式（不含潛在錯誤:-)：

$html=' 
http://dontwantthisdomain.com/dont-want-this-domain-name/ 
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/ 
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/ 
http://domain1.com/page-X-on-domain-com.html 

http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html 
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/ 
http://domain.com/page-XZ-on-domain-com.html 

http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/ 
http://dontwantthisdomain2.com/same-as-above/ 
http://domain3.com/page-XYZ-on-domain3-com.html 
'; 

$html=explode("\n", $html); 
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com'); 
foreach ($html as $link) { 
    $ok=true; 
    foreach($dontWant as $notWanted) { 
     if (strpos($link, $notWanted)>0) { 
      $ok=false; 
     } 
     if (trim($link=='')) $ok=false; 
    } 
    if ($ok) $final_result[]=$link; 
} 

echo '<pre>'; 
print_r($final_result); 
echo '</pre>';

輸出

Array 
(
    [0] => http://domain1.com/page-X-on-domain-com.html 
    [1] => http://domain.com/page-XZ-on-domain-com.html 
    [2] => http://domain3.com/page-XYZ-on-domain3-com.html 
)

來源

2013-09-25 21:18:19 davidkonrad

也許是這樣的：

function extract_domains($buffer, $whitelist) { 
    preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches); 
    $result = array(); 
    foreach($matches[1] as $url) { 
     $url = urldecode($url); 
     $parts = @parse_url((string) $url); 
     if ($parts !== false && in_array($parts['host'], $whitelist)) { 
      $result[] = $parts['host']; 
     } 
    } 
    return $result; 
} 

$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));

它做一個粗略的比賽對所有<a>與href=，抓住引號之間有什麼，然後對其過濾根據您的域的白名單。

來源

2013-09-25 21:04:55 SamT

爲什麼'@ parse_url'？只是問問。抑制錯誤不是一個好主意。 – davidkonrad

如果'parse_url（）'失敗，它會產生一個警告（或至少一個通知）。由於這些數據是我認爲「用戶輸入」的數據，因此不會告知'href =「」'屬性中將放置什麼類型的詭計。我抑制了錯誤，然後用下面的'if-statement'中的strict not equals手動檢查它。 – SamT

用於匹配和刪除URL的PHP​​ Regex或DOMDocument？

回答

相關問題

用於匹配和刪除URL的PHP Regex或DOMDocument？