2014-02-28 55 views
3

PHP履帶我們有這個簡單的HTML頁面(測試!):一個特殊的HTML元素

<html> 
<body> 
<div class="my"> One </div> 
<div class="my"> Two </div> 
<div class="my"> Three </div> 
<div class="other"> NO </div> 
<div class="other2"> NO </div> 
</body> 
</html> 

所以,我需要一個非常簡單的PHP代碼抓取。 我想抓取的東西是,我想要:「一個」,「兩個」,「三個」到一個php數組中。我需要抓取所有進入「我的」類的東西。我不想要其他課程。

回答

6

試試這個,你可以使用xpath讓你的結果

$html = '<html> 
      <body> 
      <div class="my"> One </div> 
      <div class="my"> Two </div> 
      <div class="my"> Three </div> 
      <div class="other"> NO </div> 
      <div class="other2"> NO </div> 
      </body> 
     </html>'; 

$dom = new DOMDocument(); 
$dom->loadHTML($html); 

$xpath = new DOMXPath($dom); 
$tags = $xpath->query('//div[@class="my"]'); 
foreach ($tags as $tag) { 
    $node_value = trim($tag->nodeValue); 
    echo $node_value."<br/>"; 
} 
+0

非常感謝。但一個問題。爲什麼xpath更好? – user3271403

+1

xpath直接用給定的條件搜索你的html,就像我們用'class「my」'搜索div一樣。沒有其他div會得到結果。進程將由xpath在那裏完成。我們不需要在php的頁面上操作它。 –

+0

@SatishSharma謝謝親愛的。只是一個小問題。如果我想抓取一個外部HTML頁面,我該如何將該頁面的地址放入代碼中? – user3271403

3

你應該利用DOMDocument

<?php 

$html='<html> 
<body> 
<div class="my"> One </div> 
<div class="my"> Two </div> 
<div class="my"> Three </div> 
<div class="other"> NO </div> 
<div class="other2"> NO </div> 
</body> 
</html>'; 
$dom = new DOMDocument; 
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('div') as $tag) { 
    if ($tag->getAttribute('class') === 'my') { 
     echo $tag->nodeValue; // to get the content in between of tags... 
    } 
} 

OUTPUT :

One Two Three 
+2

謝謝我的兄弟:X – user3271403

1

簡單解決方案使用simple_html_dom ...

<?php /* crawlUrlElement.php */ 
/** 
* Created by PhpStorm. 
* User: [email protected] 
* Date: 15/03/2017 
* Time: 15:01 
*/ 
require("simple_html_dom.php"); 

function crawlUrlElement($url, $search){ 

    $crawlOptions = array(
     CURLOPT_RETURNTRANSFER => true,    // return web page 
     CURLOPT_HEADER   => false,   // don't return headers 
     CURLOPT_FOLLOWLOCATION => true,    // follow redirects 
     CURLOPT_ENCODING  => "",    // handle all encodings 
     CURLOPT_USERAGENT  => "samplebot",  // who am i 
     CURLOPT_AUTOREFERER => true,    // set referer on redirect 
     CURLOPT_CONNECTTIMEOUT => 120,    // timeout on connect 
     CURLOPT_TIMEOUT  => 120,    // timeout on response 
     CURLOPT_MAXREDIRS  => 5,    // stop after 5 redirects 
    ); 

    //-- Curl Start -- 
    $curlObject = curl_init($url); 
    curl_setopt_array($curlObject,$crawlOptions); 
    $webPageContent = curl_exec($curlObject); 
    $errorNumber = curl_errno($curlObject); 
    curl_close($curlObject); 
    //-- Curl End -- 

    // Create DOM from URL or file 
    $html = file_get_html($webPageContent); 
    // Find all images 
    foreach($html->find($search) as $element){ 
     // print_r($element); 
     return (string)$element; 
    } 
} 

// echo var_dump(crawlUrlElement('http://www.google.com','body')); 
echo var_dump(crawlUrlElement('http://www.google.com','#hplogo')); 

?> 

而且你需要在 'simple_html_dom.php' 一個小的變化...

  1. 重命名參數$url$contents

  2. 註釋掉線76.

    function file_get_html($contents, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 
    { 
        // We DO force the tags to be terminated. 
        $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); 
        // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done. 
        // $contents = file_get_contents($url, $use_include_path, $context, $offset); 
    } 
    
+0

Thanks @Donald Duck –