2013-02-17 142 views
2

我寫每週一次將運行一個PHP老太婆作業腳本網絡爬蟲 - 2000多個網頁中獲取數據(TED網站爲例)

這個腳本的主要目的是從所有的TED得到細節會談上可用的TED 我們的網站(例如,爲了使這個問題更容易理解)

該腳本將花費大約70分鐘來運行,並越過2000網頁

我的問題是:

1)是有沒有更好/更快捷的方式使用該函數來獲取網頁中的每個時間,即時通訊:

file_get_contents_curl($網址)

2)它是一個很好的做法,以保持在所有會談數組(可以變得相當大)

3)有沒有更好的方法來獲得例如網站上的所有特德演講細節?在TED網站上「抓取」以獲得所有會談的最佳方式是什麼?

**我已選中使用RSS源的選項,但缺少一些我需要的細節。

感謝

<?php 
define("START_ID", 1); 
define("STOP_TED_QUERY",20); 
define ("VALID_PAGE","TED | Talks"); 
/** 
* this script will run as a cron job and will go over all pages 
* on TED http://www.ted.com/talks/view/id/ 
* from id 1 till there are no more pages 
*/ 

/** 
* function get a file using curl (fast) 
* @param $url - url which we want to get its content 
* @return the data of the file 
* @author XXXXX 
*/ 
function file_get_contents_curl($url) 
{ 
    $ch = curl_init(); 

    curl_setopt($ch, CURLOPT_HEADER, 0); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); 

    $data = curl_exec($ch); 
    curl_close($ch); 

    return $data; 
} 

//will hold all talks in array 
$tedTalks = array(); 

//id to start the query from 
$id=START_ID; 

//will indicate when needed to stop the query beacuse reached the end id's on TED website 
$endOFQuery=0; 

//get the time 
$time_start = microtime(true); 

//start the query on TED website 
//if we will query 20 pages in a row that do not exsist we will stop the querys and assume there are no more 
while ($endOFQuery < STOP_TED_QUERY){ 

    //get the page of the talk 
    $html = file_get_contents_curl("http://www.ted.com/talks/view/id/$id"); 

    //parsing begins here: 
    $doc = new DOMDocument(); 
    @$doc->loadHTML($html); 
    $nodes = $doc->getElementsByTagName('title'); 

    //get and display what you need: 
    $title = $nodes->item(0)->nodeValue; 


    //check if this a valid page 
    if (! strcmp ($title , VALID_PAGE)) 
     //this is a removed ted talk or the end of the query so raise a flag (if we get anough of these in a row we will stop) 
     $endOFQuery++; 
    else { 
     //this is a valid TED talk get its details 

     //reset the flag for end of query 
     $endOFQuery = 0; 

     //get meta tags 
     $metas = $doc->getElementsByTagName('meta'); 

     //get the tag we need (keywords) 
     for ($i = 0; $i < $metas->length; $i++) 
     { 
      $meta = $metas->item($i); 
      if($meta->getAttribute('name') == 'keywords') 
       $keywords = $meta->getAttribute('content'); 
     } 

     //create new talk object and populate it 
     $talk = new Talk(); 
     //set its ted id from ted web site 
     $talk->setID($id); 
     //parse the name (name has un-needed char's in the end) 
     $talk->setName(substr($title, 0, strpos($title, '|'))); 

     //parse the String of tags to array 
     $keywords = explode(",", $keywords); 
     //remove un-needed items from it 
     $keywords=array_diff($keywords, array("TED","Talks")); 

     //add the filters tags to the talk 
     $talk->setTags($keywords); 

     //add to the total talks array 
     $tedTalks[]=$talk; 
    } 

    //move to the next ted talk ID to query 
    $id++; 
} //end of the while 

$time_end = microtime(true); 
$execution_time = ($time_end - $time_start); 
echo "this took (sec) : ".$execution_time; 

?> 
+0

您可以使用捲曲多模式並行地抓取頁面。您也可以使用Yahoo Pipes進行調查,Yahoo Pipes會爲您在頁面中需要的特定數據進行抓取和解析。 – 2013-02-18 03:42:10

+0

Henley Chiu - 你能展示一個捲曲多模式的代碼片段嗎? – Nimrod007 2013-02-24 07:51:17

+0

我想這裏有很好的例子http://php.net/manual/en/function.curl-multi-exec.php – 2013-03-01 13:57:39

回答