2015-04-23 65 views
2

Iam編寫用xpath和curl獲取web數據的代碼。避免box xpath和curl的頭文件

這些代碼得到UL LI包含和工作。

,但我不想讓頭..

我寫下面的代碼,以避免頭,但不能

if($row->item(0)->tagName != '<ul class="graybg"><li>مدل خودرو</li> <li>مشخصات</li><li>قیمت نمایندگی</li><li>قیمت بازار آزاد</li></ul>') 

全部代碼。

$ch = curl_init ("http://www.pedal.ir/price/"); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1;  en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'); 
curl_setopt($ch, CURLOPT_HEADER, 0); 
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8'); 
$page = curl_exec($ch); 


$dom = new DOMDocument('1.0', 'utf-8'); 
libxml_use_internal_errors(true); 
$dom->loadHTML($page); 
libxml_clear_errors(); 
$xpath = new DOMXpath($dom); 

$data = array(); 
$table_rows = $xpath- >query('/html/body/div/div[1]/div/div/div/div/div/div/div[2]/ul '); // target the row (the browser rendered <tbody>, but actually it really doesnt have one) 

if($table_rows->length <= 0) { // exit if not found 
echo 'no table rows found'; 
exit; 
} 

foreach($table_rows as $tr) { // foreach row 
    $row = $tr->childNodes; 
    if($row->item(0)->tagName != '<ul class="graybg"><li>مدل خودرو</li> <li>مشخصات</li><li>قیمت نمایندگی</li><li>قیمت بازار آزاد</li></ul>') { // avoid headers 
     $data[] = array(
     'moled' =>trim($row->item(0)->nodeValue), 
     'detail' => trim($row->item(2)->nodeValue), 
      'pricenama' => trim($row->item(4)->nodeValue), 
      'pricebaza' => trim($row->item(6)->nodeValue), 
    ); 
    } 
    } 

    echo '<pre>'; 
    print_r($data);; 

回答

1

作爲替代,由於報頭具有不同的類標識它,則可以將它包括在檢查內:

foreach($table_rows as $tr) { // foreach row 
    $row = $tr->childNodes; 

    if($row->item(0)->parentNode->getAttribute('class') !== 'graybg') { // avoid headers 
     $data[] = array(
      'moled' =>trim($row->item(0)->nodeValue), 
      'detail' => trim($row->item(2)->nodeValue), 
      'pricenama' => trim($row->item(4)->nodeValue), 
      'pricebaza' => trim($row->item(6)->nodeValue), 
     ); 
    } 
} 

Sample Output

1

您可以添加謂詞[not(@class)]到您的XPath來過濾掉<ul>class屬性:

/html/body/div/div[1]/div/div/div/div/div/div/div[2]/ul[not(@class)] 

無論如何,絕對路徑是不可靠的,因爲它往往打破由於對HTML源細微變化。嘗試根據元素的idclass改爲構建xpath。

+0

由於wordked .. – rahavard