簡單解決方案使用simple_html_dom ...
<?php /* crawlUrlElement.php */
/**
* Created by PhpStorm.
* User: [email protected]
* Date: 15/03/2017
* Time: 15:01
*/
require("simple_html_dom.php");
function crawlUrlElement($url, $search){
$crawlOptions = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "samplebot", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 5, // stop after 5 redirects
);
//-- Curl Start --
$curlObject = curl_init($url);
curl_setopt_array($curlObject,$crawlOptions);
$webPageContent = curl_exec($curlObject);
$errorNumber = curl_errno($curlObject);
curl_close($curlObject);
//-- Curl End --
// Create DOM from URL or file
$html = file_get_html($webPageContent);
// Find all images
foreach($html->find($search) as $element){
// print_r($element);
return (string)$element;
}
}
// echo var_dump(crawlUrlElement('http://www.google.com','body'));
echo var_dump(crawlUrlElement('http://www.google.com','#hplogo'));
?>
而且你需要在 'simple_html_dom.php' 一個小的變化...
重命名參數$url
到$contents
註釋掉線76.
function file_get_html($contents, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
// $contents = file_get_contents($url, $use_include_path, $context, $offset);
}
非常感謝。但一個問題。爲什麼xpath更好? – user3271403
xpath直接用給定的條件搜索你的html,就像我們用'class「my」'搜索div一樣。沒有其他div會得到結果。進程將由xpath在那裏完成。我們不需要在php的頁面上操作它。 –
@SatishSharma謝謝親愛的。只是一個小問題。如果我想抓取一個外部HTML頁面,我該如何將該頁面的地址放入代碼中? – user3271403