2012-11-23 38 views
2

我使用PHP腳本(使用捲曲)很慢,檢查是否:優化PHP捲曲基於鏈接檢查腳本 - 目前

  • 在我的數據庫的鏈接都是正確的(即返回HTTP狀態200)
  • 這些鏈接其實重定向和重定向到一個適當的/類似的網頁(使用頁面的內容)

這樣做的結果保存到一個日誌文件,並通過電子郵件發送給我作爲附件。

這一切都很好,而且工作起來很慢,但它很慢,因爲所有地獄和一半的時間都會超時,並且很快就會中止。值得注意的是,我有大約16,000個鏈接要檢查。

想知道如何更好地使這個運行更快,我做錯了什麼?下面

代碼:

function echoappend ($file,$tobewritten) { 

     fwrite($file,$tobewritten); 
     echo $tobewritten; 
} 

error_reporting(E_ALL); 
ini_set('display_errors', '1'); 


$filename=date('YmdHis') . "linkcheck.htm"; 
echo $filename; 
$file = fopen($filename,"w+"); 

try { 
     $conn = new PDO('mysql:host=localhost;dbname=databasename',$un,$pw); 
     $conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); 
     echo '<b>connected to db</b><br /><br />'; 

     $sitearray = array("medical.posterous","ebm.posterous","behavenet","guidance.nice","www.rch","emedicine","www.chw","www.rxlist","www.cks.nhs.uk"); 

     foreach ($sitearray as $key => $value) {  
      $site=$value; 

      echoappend ($file, "<h1>" . $site . "</h1>"); 

      $q="SELECT * FROM link WHERE url LIKE :site"; 
      $stmt = $conn->prepare($q); 
      $stmt->execute(array(':site' => 'http://' . $site . '%')); 
      $result = $stmt->fetchAll(); 

      $totallinks = 0; 
      $workinglinks = 0; 

      foreach($result as $row) 
      { 

       $ch = curl_init(); 
       $originalurl = $row['url']; 

       curl_setopt($ch, CURLOPT_URL, $originalurl); 
       curl_setopt($ch, CURLOPT_HEADER, 1); 
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
       curl_setopt($ch, CURLOPT_NOBODY, true); 
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); 


       $output = curl_exec($ch); 
       if ($output === FALSE) { 
        echo "cURL Error: " . curl_error($ch); 
       } 

       $urlinfo = curl_getinfo($ch); 

       if ($urlinfo['http_code'] == 200) 
       { 
        echoappend($file, $row['name'] . ": <b>working!</b><br />"); 
        $workinglinks++; 
       } 
       else if ($urlinfo['http_code'] == 301 || 302) 
       { 
        $redirectch = curl_init();     
        curl_setopt($redirectch, CURLOPT_URL, $originalurl); 
        curl_setopt($redirectch, CURLOPT_HEADER, 1); 
        curl_setopt($redirectch, CURLOPT_RETURNTRANSFER, 1); 
        curl_setopt($redirectch, CURLOPT_NOBODY, false); 
        curl_setopt($redirectch, CURLOPT_FOLLOWLOCATION, true); 

        $redirectoutput = curl_exec($redirectch); 

        $doc = new DOMDocument(); 
        @$doc->loadHTML($redirectoutput); 
        $nodes = $doc->getElementsByTagName('title'); 

        $title = $nodes->item(0)->nodeValue; 

        echoappend ($file, $row['name'] . ": <b>redirect ... </b>" . $title . " ... "); 

        if (strpos(strtolower($title),strtolower($row['name']))===false) { 
         echoappend ($file, "FAIL<br />"); 
        } 
        else { 
         $header = curl_getinfo($redirectch); 
         echoappend ($file, $header['url']); 
         echoappend ($file, "SUCCESS<br />"); 
        } 

        curl_close($redirectch); 
       } 
       else 
       { 
        echoappend ($file, $row['name'] . ": <b>FAIL code</b>" . $urlinfo['http_code'] . "<br />"); 
       } 

       curl_close($ch); 

       $totallinks++; 
      } 
      echoappend ($file, '<br />'); 

      echoappend ($file, $site . ": " . $workinglinks . "/" . $totallinks . " links working. <br /><br />"); 


     } 

     $conn = null; 
     echo '<br /><b>connection closed</b><br /><br />'; 

    } catch(PDOException $e) { 
      echo 'ERROR: ' . $e->getMessage(); 
    } 
+0

當您的腳本自行中止時,錯誤消息是什麼? – ariefbayu

+0

嘗試'CURLOPT_CONNECTTIMEOUT'和'CURLOPT_TIMEOUT'並將其設置爲5(?)秒。 –

+0

我在這些行上得到了「PHP致命錯誤:最大執行時間超過30秒」:「$ output = curl_exec($ ch);」和「$ redirectoutput = curl_exec($ redirectch);」 – Tomcat

回答

2

簡短的回答是使用curl_multi_ *方法並行您的要求。

緩慢的原因是Web請求相對較慢。有時非常慢。使用curl_multi_ *函數可以同時運行多個請求。

要注意的一件事是限制一次運行的請求數量。換句話說,不要一次運行16,000個請求。也許從16開始,看看那是怎麼回事。

下面的例子應該可以幫助您開始:

<?php 

// 
// Fetch a bunch of URLs in parallel. Returns an array of results indexed 
// by URL. 
// 
function fetch_urls($urls, $curl_options = array()) { 
    $curl_multi = curl_multi_init(); 
    $handles = array(); 

    $options = $curl_options + array(
    CURLOPT_HEADER   => true, 
    CURLOPT_RETURNTRANSFER => true, 
    CURLOPT_NOBODY   => true, 
    CURLOPT_FOLLOWLOCATION => true); 

    foreach($urls as $url) { 
    $handles[$url] = curl_init($url); 
    curl_setopt_array($handles[$url], $options); 
    curl_multi_add_handle($curl_multi, $handles[$url]); 
    } 

    $active = null; 
    do { 
    $status = curl_multi_exec($curl_multi, $active); 
    } while ($status == CURLM_CALL_MULTI_PERFORM); 

    while ($active && ($status == CURLM_OK)) { 
    if (curl_multi_select($curl_multi) != -1) { 
     do { 
     $status = curl_multi_exec($curl_multi, $active); 
     } while ($status == CURLM_CALL_MULTI_PERFORM); 
    } 
    } 

    if ($status != CURLM_OK) { 
    trigger_error("Curl multi read error $status\n", E_USER_WARNING); 
    } 

    $results = array(); 
    foreach($handles as $url => $handle) { 
    $results[$url] = curl_getinfo($handle); 
    curl_multi_remove_handle($curl_multi, $handle); 
    curl_close($handle);  
    } 
    curl_multi_close($curl_multi); 

    return $results; 
} 

// 
// The urls to test 
// 
$urls = array("http://google.com", "http://yahoo.com", "http://google.com/probably-bogus", "http://www.google.com.au"); 

// 
// The number of URLs to test simultaneously 
// 
$request_limit = 2; 

// 
// Test URLs in batches 
// 
$redirected_urls = array(); 
for ($i = 0 ; $i < count($urls) ; $i += $request_limit) { 
    $results = fetch_urls(array_slice($urls, $i, $request_limit)); 
    foreach($results as $url => $result) { 
    if ($result['http_code'] == 200) { 
     $status = "Worked!"; 
    } else { 
     $status = "FAILED with {$result['http_code']}"; 
    } 
    if ($result["redirect_count"] > 0) { 
     array_push($redirected_urls, $url); 
     echo "{$url}: ${status}\n"; 
    } else { 
     echo "{$url}: redirected to {$result['url']} and {$status}\n"; 
    } 
    } 
} 

// 
// Handle redirected URLs 
// 
echo "Processing redirected URLs...\n"; 
for ($i = 0 ; $i < count($redirected_urls) ; $i += $request_limit) { 
    $results = fetch_urls(array_slice($redirected_urls, $i, $request_limit), array(CURLOPT_FOLLOWLOCATION => false)); 
    foreach($results as $url => $result) { 
    if ($result['http_code'] == 301) { 
     echo "{$url} permanently redirected to {$result['url']}\n"; 
    } else if ($result['http_code'] == 302) { 
     echo "{$url} termporarily redirected to {$result['url']}\n"; 
    } else { 
     echo "{$url}: FAILED with {$result['http_code']}\n"; 
    } 
    } 
} 

上面的代碼處理的URL分批列表。它有兩個通行證。在第一遍中,每個請求都配置爲遵循重定向,並簡單報告每個URL是最終導致成功請求還是失敗。

第二遍處理在第一遍中檢測到的任何重定向URL,並報告重定向是否是永久性重定向(意味着您可以使用新URL更新數據庫)或臨時(意味着您不應更新數據庫)。

注:

在你的原代碼,你有下面這行,這是行不通的,你指望它的方式:

else if ($urlinfo['http_code'] == 301 || 302) 

表達將始終返回true。正確的表達是:

else if ($urlinfo['http_code'] == 301 || $urlinfo['http_code'] == 302) 
+0

謝謝!將很快嘗試,看看是否至少可以讓它工作得更快一些:o) – Tomcat

0

而且,把

set_time_limit(0); 

在腳本的頂部,制止它放棄當它擊中30秒。