如何使用PHP和Regex提取特定域名的鏈接？

我試圖從包含HTML的數據庫列中提取包含www.domain.com的網址。正則表達式必須過濾出www2.domain.com實例和外部URL，如www.domainxyz.com。它應該只搜索適當編碼的錨鏈接。如何使用PHP和Regex提取特定域名的鏈接？

這是我到目前爲止有：

<?php 
    $content = '<html> 
    <title>Random Website</title> 
    <body> 
     Click <a href="http://domainxyz.com">here</a> for foobar 
     Another site is http://www.domain.com 
     <a href="http://www.domain.com/test">Test 1</a> 
     <a href="http://www2.domain.com/test">Test 2</a> 
     <Strong>NOT A LINK</strong> 
    </body> 
    </html>'; 

    $regex = "((https?)\:\/\/)?"; 
    $regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; 
    $regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; 
    $regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; 
    $regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; 
    $regex .= "([www\.domain\.com])"; 

    $matches = array(); //create array 
    $pattern = "/$regex/"; 

    preg_match_all($pattern, $content, $matches); 

    print_r(array_values(array_unique($matches[0]))); 
    echo "<br><br>"; 
    echo implode("<br>", array_values(array_unique($matches[0]))); 
?>

我找這個找和輸出僅http://www.domain.com/test。

如何修改我的正則表達式來完成此操作？

來源

2015-09-08 andyy15

基於DOMDocument和DOMXPath的解決方案如何？我看到你只是提取href屬性值，對吧？ –

謝謝，我考慮過這個，但是如果從數據庫查詢中獲取html，會有這樣的解決方案嗎？ – andyy15

請檢查[此代碼]（http://ideone.com/L1DDDp）。我建議在這裏使用正則表達式只是作爲最後手段。 –

這裏是提取含有www.domain.com的ahref屬性值，其中的關鍵是XPath '//a[contains(@href, "www.domain.com")]'一個更安全的方式：

$html = "YOUR_HTML_STRING"; // Your HTML string 
$dom = new DOMDocument;  
$dom->loadHTML($html); 
$xpath = new DOMXPath($dom); 
$arr = array(); 
$links = $xpath->query('//a[contains(@href, "www.domain.com")]'); 

foreach($links as $link) { 
    array_push($arr, $link->getAttribute("href")); 
} 

print_r($arr);

見IDEONE demo，結果是：

正如你看到的，您也可以使用DOMDocument和DOMXPath。

的代碼是不言自明的，XPath表達式只是意味着找到所有<a>標籤具有含www.domain.com一個href屬性。

來源

2015-09-08 22:28:04

感謝您分享如何使用DOMDocument和DOMXPath方法。由於這是一個比正則表達式更好的解決方案，所以我最終走上了這條路線:) – andyy15

如何使用PHP和Regex提取特定域名的鏈接？

回答

相關問題