php爬蟲（抓取單個網站）

我正在抓取爬蟲項目，我需要一些幫助，這是我的第一個項目。任務是從'http://justdial.com'獲取數據。例如，我想要取得城市名稱（班加羅爾），類別（酒店），酒店名稱，地址和電話號碼。php爬蟲（抓取單個網站）

我已經寫了代碼來從它的「身份證」的標籤內容，就像我從這裏獲取地址：

<?php 

$url="http://www.justdial.com/Bangalore/hotels"; 
$original_file = file_get_contents("$url"); 
$stripped_file = strip_tags($original_file, "<div>"); 

$newlines="'<div class=\"logoDesc\">(.*?)</div>'si"; 
$newlines=preg_replace('#<div(?:[^>]*)>.</div>#u','',$newlines); 

preg_match_all("$newlines", $stripped_file, $matches); 


//DEBUGGING 

    //$matches[0] now contains the complete A tags; ex: <a href="link">text</a> 
    //$matches[1] now contains only the HREFs in the A tags; ex: link 

    header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read! 
$path= ($matches); 

print_r($path); //View the array to see if it worked 
?>

現在的問題是，我想單獨從內容的標籤並將其存儲在數據庫中。並從數據庫到Excel表格。請幫幫我。

來源

2012-10-03 user1716393

你的意思是'strip_tags（）'？ –

路徑包含什麼？請告訴我們轉儲。你有沒有嘗試過任何數據庫代碼？是否需要從數據庫 - > excel，或者可以同時生成Excel表單？它必須是xls，還是csv就足夠了？ – LeonardChallis

你的意思是[MySQL]（http://php.net/mysqli）和[fputcsv]（http://php.net/fputcsv）？ – Touki

你不應該使用正則表達式來解析HTML。你應該使用類似DomDocument的東西。它使用中的一個小例子：

<?php 
    $str = '<h1>T1</h1>Lorem ipsum.<h1>T2</h1>The quick red fox...<h1>T3</h1>... jumps over the lazy brown FROG'; 
    $DOM = new DOMDocument; 
    $DOM->loadHTML($str); 

    //get all H1 
    $items = $DOM->getElementsByTagName('h1'); 

    //display all H1 text 
    for ($i = 0; $i < $items->length; $i++) 
     echo $items->item($i)->nodeValue . "<br/>"; 
?>

來源

2012-10-03 08:12:47

我已經使用了html解析來解析php的內容。這是代碼。 – user1716393

hello @wayne，我已經包含了html解析器來解析php的內容。我不想使用數據庫，我想使用記事本。作爲'justdial.com'的頁面運行數據必須存儲在記事本中，然後從記事本存儲到Excel表格中。 – user1716393

php爬蟲（抓取單個網站）

回答

相關問題