2014-07-19 53 views
0

我試圖從使用php bot的外部網站提取鏈接。鏈接在裏面在DOMXpath中使用查詢不會工作,如果類名包含空格

<td class=" title-col"> <a href="http://examplenews101.com/post1">News 1</a> </td> 

注意「title-col」前有空格。

這裏是腳本進出口使用它無法提取鏈接

function crawl_page($url, $depth = 5) { 
static $seen = array(); 
if (isset($seen[$url]) || $depth === 0) { 
    return; 
} 

$seen[$url] = true; 

$dom = new DOMDocument('1.0'); 
//als tried true , but no change in results 
$dom->preserveWhiteSpace = false; 
@$dom->loadHTMLFile($url); 
$xpath = new DOMXpath($dom); 
$td = $xpath->query('//td[contains(concat(" ", normalize-space(@class), " "), "title-col")]'); 
// also tried this, but not working 
//$td = $xpath->query('//td[contains(@class,"title-col")]'); 

//I only get values when I use this 
//$td = $dom->getElementsByTagName('td'); 

foreach($td as $t) { 
    $anchors = $t->getElementsByTagName('a'); 
    foreach ($anchors as $element) { 
     $href = $element->getAttribute('href'); 
     if (0 !== strpos($href, 'http')) { 
      $path = '/' . ltrim($href, '/'); 
      if (extension_loaded('http')) { 
      $href = http_build_url($url, array('path' => $path)); 
      } 
      else { 
       $parts = parse_url($url); 
       $href = $parts['scheme'] . '://'; 
       if (isset($parts['user']) && isset($parts['pass'])) { 
        $href .= $parts['user'] . ':' . $parts['pass'] . '@'; 
       } 
       $href .= $parts['host']; 
       if (isset($parts['port'])) { 
        $href .= ':' . $parts['port']; 
       } 
       $href .= $path; 
      } 
     } 
     crawl_page($href, $depth - 1); 
    } 
} 

echo "URL:" . $url . "<br/>"; 

} 

我只得到值當我使用這個

$td = $dom->getElementsByTagName('td'); 

但我需要通過類來查詢。

THanks

回答

0

我想通了,這是由於JavaScript生成的屬性。

相關問題