一個特殊的HTML元素

PHP履帶我們有這個簡單的HTML頁面（測試！）：一個特殊的HTML元素

<html> 
<body> 
<div class="my"> One </div> 
<div class="my"> Two </div> 
<div class="my"> Three </div> 
<div class="other"> NO </div> 
<div class="other2"> NO </div> 
</body> 
</html>

所以，我需要一個非常簡單的PHP代碼抓取。我想抓取的東西是，我想要：「一個」，「兩個」，「三個」到一個php數組中。我需要抓取所有進入「我的」類的東西。我不想要其他課程。

來源

2014-02-28 user3271403

試試這個，你可以使用xpath讓你的結果

$html = '<html> 
      <body> 
      <div class="my"> One </div> 
      <div class="my"> Two </div> 
      <div class="my"> Three </div> 
      <div class="other"> NO </div> 
      <div class="other2"> NO </div> 
      </body> 
     </html>'; 

$dom = new DOMDocument(); 
$dom->loadHTML($html); 

$xpath = new DOMXPath($dom); 
$tags = $xpath->query('//div[@class="my"]'); 
foreach ($tags as $tag) { 
    $node_value = trim($tag->nodeValue); 
    echo $node_value."<br/>"; 
}

來源

2014-02-28 11:02:39

非常感謝。但一個問題。爲什麼xpath更好？ – user3271403

xpath直接用給定的條件搜索你的html，就像我們用'class「my」'搜索div一樣。沒有其他div會得到結果。進程將由xpath在那裏完成。我們不需要在php的頁面上操作它。 –

@SatishSharma謝謝親愛的。只是一個小問題。如果我想抓取一個外部HTML頁面，我該如何將該頁面的地址放入代碼中？ – user3271403

你應該利用DOMDocument類

<?php 

$html='<html> 
<body> 
<div class="my"> One </div> 
<div class="my"> Two </div> 
<div class="my"> Three </div> 
<div class="other"> NO </div> 
<div class="other2"> NO </div> 
</body> 
</html>'; 
$dom = new DOMDocument; 
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('div') as $tag) { 
    if ($tag->getAttribute('class') === 'my') { 
     echo $tag->nodeValue; // to get the content in between of tags... 
    } 
}

OUTPUT :

One Two Three

來源

2014-02-28 10:58:38

謝謝我的兄弟：X – user3271403

簡單解決方案使用simple_html_dom ...

<?php /* crawlUrlElement.php */ 
/** 
* Created by PhpStorm. 
* User: [email protected] 
* Date: 15/03/2017 
* Time: 15:01 
*/ 
require("simple_html_dom.php"); 

function crawlUrlElement($url, $search){ 

    $crawlOptions = array(
     CURLOPT_RETURNTRANSFER => true,    // return web page 
     CURLOPT_HEADER   => false,   // don't return headers 
     CURLOPT_FOLLOWLOCATION => true,    // follow redirects 
     CURLOPT_ENCODING  => "",    // handle all encodings 
     CURLOPT_USERAGENT  => "samplebot",  // who am i 
     CURLOPT_AUTOREFERER => true,    // set referer on redirect 
     CURLOPT_CONNECTTIMEOUT => 120,    // timeout on connect 
     CURLOPT_TIMEOUT  => 120,    // timeout on response 
     CURLOPT_MAXREDIRS  => 5,    // stop after 5 redirects 
    ); 

    //-- Curl Start -- 
    $curlObject = curl_init($url); 
    curl_setopt_array($curlObject,$crawlOptions); 
    $webPageContent = curl_exec($curlObject); 
    $errorNumber = curl_errno($curlObject); 
    curl_close($curlObject); 
    //-- Curl End -- 

    // Create DOM from URL or file 
    $html = file_get_html($webPageContent); 
    // Find all images 
    foreach($html->find($search) as $element){ 
     // print_r($element); 
     return (string)$element; 
    } 
} 

// echo var_dump(crawlUrlElement('http://www.google.com','body')); 
echo var_dump(crawlUrlElement('http://www.google.com','#hplogo')); 

?>

而且你需要在 'simple_html_dom.php' 一個小的變化...

重命名參數$url到$contents

註釋掉線76.

function file_get_html($contents, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 
{ 
    // We DO force the tags to be terminated. 
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); 
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done. 
    // $contents = file_get_contents($url, $use_include_path, $context, $offset); 
}

來源

2017-03-15 15:14:49

Thanks @Donald Duck –

一個特殊的HTML元素

回答

相關問題