2013-07-13 30 views
0

我想選擇一個HTML頁面的所有網址到像數組:PHP:DOM獲取URL和錨(但不是IMG)

This is a webpage <a href="http://somesite.com/link1.php">with</a> 
different kinds of <a href="http://somesite.com/link1.php"><img src="someimg.png"></a> 

輸出我想是:

with => http://somesite.se/link1.php 

現在我得到:

<img src="someimg.png"> => http://somesite.com/link1.php 
with => http://somesite.com/link1.php 

我不想讓網址/,它包含的起點和終點之間的圖像鏈接。只有文字的。

我當前的代碼是:

<?php 

function innerHTML($node) { 
    $ret = ''; 

    foreach ($node->childNodes as $node) { 
     $ret .= $node->ownerDocument->saveHTML($node); 
    } 

    return $ret; 
} 

$html = file_get_contents('http://somesite.com/'.$_GET['apt']); 

$dom = new DOMDocument; 
@$dom->loadHTML($html); // @ = Removes errors from the HTML... 
$links = $dom->getElementsByTagName('a'); 
$result = array(); 

foreach ($links as $link) { 
    //$node = $link->nodeValue; 
    $node = innerHTML($link); 
    $href = $link->getAttribute('href'); 

    if (preg_match('/\.pdf$/i', $href)) 
      $result[$node] = $href; 
} 

print_r($result); 

?> 

回答

-1

添加第二個preg_match到您的條件:

if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href; 
+0

完美!謝謝! :) –