2011-05-11 35 views
1

我一直在挑戰屏幕刮這個網站。它與許多其他網站一起工作,但出於某種原因只能獲取此頁眉的頁眉和頁腳(http://www.coast-stores.com/SOPHIE-DRESS/Dresses/coast/fcp-product/2224724715)使用cURL屏幕刮只獲取頁眉和頁腳

function get_url($url) { 
$curl = curl_init(); 

$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; 
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
$header[] = "Cache-Control: max-age=0"; 
$header[] = "Connection: keep-alive"; 
$header[] = "Keep-Alive: 300"; 
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
$header[] = "Accept-Language: en-us,en;q=0.5"; 
$header[] = "Pragma: "; 

$cookie = '/cookies.txt'; 
$timeout = 30; 




curl_setopt($curl, CURLOPT_URL,    $url); 
curl_setopt($curl, CURLOPT_USERAGENT,  'Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)'); 
curl_setopt($curl, CURLOPT_HTTPHEADER,  $header); 
curl_setopt($curl, CURLOPT_ENCODING,  'gzip,deflate'); 
curl_setopt($curl, CURLOPT_AUTOREFERER,  true); 
curl_setopt($curl, CURLOPT_REFERER,   'http://google.co.uk/'); 
curl_setopt($curl, CURLOPT_TIMEOUT,   20); 
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, $timeout); 
curl_setopt($curl, CURLOPT_COOKIEJAR,  $cookie); 
curl_setopt($curl, CURLOPT_COOKIEFILE,  $cookie); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); # required for https urls 
curl_setopt($curl, CURLOPT_MAXREDIRS,  30); 

curl_setopt($curl, CURLOPT_BINARYTRANSFER, true); 

$responseHTML = curl_exec($curl); 
$response  = curl_getinfo($curl); 

curl_close($curl); // close the connection 

//return $html; // and finally, return $html 


if ($response['http_code'] == 301 || $response['http_code'] == 302) 
{ 
    ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1"); 

    if ($headers = get_headers($response['url'])) 
    { 
     foreach($headers as $value) 
     { 
      if (substr(strtolower($value), 0, 9) == "location:") 
       return get_url(trim(substr($value, 9, strlen($value)))); 
     } 
    } 
} 

if (
    (preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) 
    || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value)) 
    && $javascript_loop < 5 
) 
{ 
    return get_url($value[1], $javascript_loop+1); 
} 
else 
{ 
    return $responseHTML; //array($content, $response); 
} 
} 

// uses the function and displays the text off the website 
$text = get_url($_GET['url']); 
echo $text; 

任何想法爲什麼它沒有獲取主要內容? HTML顯示後是否會傳送內容?腳本的

實例運行在這裏:http://www.mattfacer.com/scraping/scraping2.php?url=http://www.coast-stores.com/SOPHIE-DRESS/Dresses/coast/fcp-product/2224724715

與任何其他網站嘗試它,它似乎工作!

感謝您的幫助!

+0

該URL甚至在我的瀏覽器中不起作用。我每次都會收到一個301到主頁。我甚至無法手動導航到該頁面。我不認爲它已經存在了。很難找到已刪除的網址。 – 2011-05-11 22:31:17

+0

如果您嘗試使用實際存在的網址,則效果不錯:http://www.mattfacer.com/scraping/scraping2.php?url=http://us.coast-stores.com/Coast-Allure-Maxi/dp/ B004SFF0V2 – 2011-05-11 22:34:45

+0

這很奇怪 - 我想知道他們的網站是否在IP地址上使用某種國家檢查?服務器處於狀態。我認爲你也是 - 因爲我也可以看到整個頁面的美元價格。然而,我可以在我的例子中看到網址,但不是內容。一切都很奇怪。 – 2011-05-12 09:08:39

回答

0

檢出網址,但它被重定向回主頁。你能否告訴我們如何進入你所尋找的頁面?網頁消失或者網站使用cookies來訪問網站的某個部分。