2012-09-13 165 views
0

我有一個函數,它將一個url數組作爲輸入。我已經驗證了網址是正確的,我可以完美地循環瀏覽它們。我還使用curl_getinfo驗證了curl正在下載正確的頁面。但是,curl(html)的輸出對於每個頁面都是相同的。這裏是我的代碼:PHP Curl下載問題

  $urls = array(); 
    $urls = getpages($mainpage); 
    print_r($urls); 
    foreach($urls as $link) { 
     echo $link. '<br><br><br>'; 
     $circdl = my_curl($link); 
     echo $circdl. '<br><br><br>'; 
     $circdl = NULL; 
    } 

的URL的輸出數組如下:

Array ([0] => http://www.site.com/savings/viewcircular?promotionId=81498&sneakpeek=&currentPageNumber=1 [1] => http://www.site.com/savings/viewcircular?promotionId=81498&sneakpeek=&currentPageNumber=2 

$鏈接也輸出適當一樣在curl_getinfo捲曲。我已經運行了另一個url數組通過這個循環,他們工作正常,但我懷疑這裏的問題是與網址(&符號)的格式。我真的難住爲什麼這些網頁沒有按預期下載。

這裏的my_curl功能:

function my_curl($url) 
{ 
$timeout=10; 
$error_report=TRUE; 
$curl = curl_init(); 
$cookiepath = drupal_get_path('module','mymodule'). '/cookies.txt'; 

// HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE 
$header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
$header[] = "Cache-Control: max-age=0"; 
$header[] = "Connection: keep-alive"; 
$header[] = "Keep-Alive: 300"; 
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
$header[] = "Accept-Language: en-us,en;q=0.5"; 
$header[] = "Pragma: "; // BROWSERS USUALLY LEAVE BLANK 

// SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php 
curl_setopt($curl, CURLOPT_URL,   $url ); 
curl_setopt($curl, CURLOPT_USERAGENT,  'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' ); 
curl_setopt($curl, CURLOPT_HTTPHEADER,  $header ); 
curl_setopt($curl, CURLOPT_REFERER,  'http://www.google.com' ); 
curl_setopt($curl, CURLOPT_ENCODING,  'gzip,deflate' ); 
curl_setopt($curl, CURLOPT_AUTOREFERER, TRUE ); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE ); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE ); 
curl_setopt($curl, CURLOPT_COOKIEFILE,  $cookiepath); 
curl_setopt($curl, CURLOPT_COOKIEJAR,  $cookiepath); 
curl_setopt($curl, CURLOPT_TIMEOUT,  $timeout ); 

// RUN THE CURL REQUEST AND GET THE RESULTS 
$htm = curl_exec($curl); 

// Check for page request 

//$info = curl_getinfo($curl); 
//echo 'Took ' . $info['total_time'] . ' seconds to send a request to ' . $info['url']; 

// ON FAILURE HANDLE ERROR MESSAGE 
if ($htm === FALSE) 
{ 
    if ($error_report) 
    { 
     $err = curl_errno($curl); 
     $inf = curl_getinfo($curl); 
     echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err"; 
     var_dump($inf); 
    } 
    curl_close($curl); 
    return FALSE; 
} 

// ON SUCCESS RETURN XML/HTML STRING 
curl_close($curl); 
return $htm; 

}

什麼是非常有趣的是,如果我運行此:

echo my_curl('http://www.site.com/savings/viewcircular?promotionId=81498&sneakpeek=&currentPageNumber=2') 

輸出是正確的!! ?? :(

感謝您的幫助!

+0

你可以發佈'my_curl()'方法的代碼,因爲它看起來像是保存相關代碼的函數嗎? – newfurniturey

+0

我剛剛創建了一個包含這兩個頁面的數組,並且在循環中運行它,結果很好。我能看到的唯一區別是$ link變量顯示了這個:http://www.site.com/savings/viewcircular?promotionId=81498&sneakpeek=¤tPageNumber=1而不是這個http://www.site.com /儲蓄/ viewcircular?promotionId = 81498&sneakpeek =&currentPageNumber = 1。我絕對認爲這是一個編碼問題。 –

回答

0

我發現問題在於URL的編碼被傳遞給我的函數。我錯誤地剝去了編碼,並在url上附加了一個「人類可讀的」結尾。這導致該頁面無法被主機正確識別。我如何解決這個問題就是忽視我更好的判斷力,並且只留下編碼。當數組通過時,頁面加載正確。感謝所有看過這些內容的人。它真的讓我難住了!

這裏是我的代碼片段的解釋:

function getpages($url) { 
global $host; 
$circdl = my_curl($url); 
$circqp = htmlqp($circdl,'body'); 
//Extract last page number 
$lastpagenumber = $circqp->branch()->find('li[class="last-page"]')->text(); 
$lastpagenumberurl = $circqp->branch()->find('li[class="last-page"]')->children('a')->attr('href'); 
//Extract page link root 
$pagelinkroot = substr_replace($lastpagenumberurl,"",-2); 
$currentpage = "=";    
$lpn = intval($lastpagenumber); 

//Move through the remaining pages 
$pagelinks = array(); 
    for ($i = 1; $i <= $lpn; ++$i) { 
    $pagelinks[] = join(array($host,$pagelinkroot,$currentpage,$i)); 
    } 
    return $pagelinks; 
} 

Substr_replace用於科技和政策下來ecoding。我將它從20改爲2,只是將結尾去掉,然後在循環後通過鏈接追加它。