2015-07-10 101 views
2

我試圖從THIS頁面掃描價格,我想使用此preg_match從此div提取價格:<span class="price"><b>519,00&nbsp;€</b></span>。什麼是正確的preg_match?preg_match模式來掃描這些價格

這是我的提取腳本:

<?php 
echo "funziona!"; 

    if(!$fp = fopen("https://www.google.it/webhp?sourceid=chrome-instant&ion=1&espv=2&es_th=1&ie=UTF-8#tbs=vw:l,mr:1&tbm=shop&q=samsung+galaxy+note+4&tbas=0" ,"r")) { 
     return false; 
    } //our fopen is right, so let's go 
    $content = ""; 

    while(!feof($fp)) { //while it is not the last line, we will add the current line to our $content 
     $content .= fgets($fp, 1024); 
    } 
    fclose($fp); //we are done here, don't need the main source anymore 
?> 

<?php 
//our fopen, fgets here 

//our magic regex here 
preg_match_all('/<span class=\"price">(.*?)<\/span>/s',$content, $prices); //THIS IS PREG_MATCH 
    echo $prices[0][0]."<br />"; 
?> 

我從來沒有使用過的preg_match,我努力適應這個腳本。
謝謝。

+0

與當前的代碼會發生什麼?你不需要避免雙引號''''你也想要第一個索引,而不是價格的零索引 – chris85

+0

這應該打印來自網頁的價格,但是有錯誤。完整的代碼在本指南中http://www.1stwebdesigner.com/php-crawler-tutorial/ – leofabri

+2

沒有「正確的」preg。regexes + html =壞主意。使用DOM解析器。 –

回答

1

看一看這樣的:

<?php 
function getUrl($Url,$Options = array(),&$optOut = array()) 
{ 

    $CURL_DEFAULT_SETTINGS = array 
    (
     CURLOPT_FOLLOWLOCATION => true, 
     CURLOPT_AUTOREFERER => true, 
     CURLOPT_RETURNTRANSFER => true, 
     CURLOPT_CONNECTTIMEOUT => 10, 
     CURLOPT_MAXREDIRS => 10, 
     CURLOPT_TIMEOUT => 10, 
     CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8' 
    ); 

    if (!($ch = curl_init($Url))) 
     throw new Exception("Couldn't initialize cURL library",100); 

    if (is_array($CURL_DEFAULT_SETTINGS) && count($CURL_DEFAULT_SETTINGS) > 0) 
     curl_setopt_array($ch,$CURL_DEFAULT_SETTINGS); 

    if (is_array($Options) && count($Options) > 0) 
    { 
     foreach ($Options as $k => $v) 
     { 
      curl_setopt($ch,$k,$v); 
     } 
    } 

    $Data = curl_exec($ch); 
    $Error = curl_error($ch); 

    $optOut['CURLINFO_HEADER_OUT'] = curl_getinfo($ch, CURLINFO_HEADER_OUT); 

    curl_close($ch); 

    if (!$Data) 
    { 
     if ($Error) 
      throw new Exception($Error); 

     return false; 
    } 

    return $Data; 
} 

function getPriceFor($query) { 
    $data = getUrl('https://www.google.it/search?tbs=vw:l,mr:1&tbm=shop&q='.rawurlencode($query).'&tbas=0&bav=on.2,or.&cad=b&fp=6a24b60e09fe0b18&biw=1196&bih=703&dpr=2&ion=1&espv=2&tch=1&ech=1&psi=byWgVee9A4TNeIXRgLAK.1436558704099.3'); 
    $data = '['.preg_replace('/\/\*""\*\//msi',',',preg_replace('/\/\*""\*\/[\s]*$/msi','',$data)).']'; 
    $data = json_decode($data,true); 
    preg_match_all('/<div[\s]+class="_OA"><div><b>([^<]+)[\s]*<\/b><\/div><div>([^<]+)<\/div><\/div>/msi',$data[3]['d'],$res); 

    $re = array(); 

    foreach ($res[1] as $k=>$r) 
     $re[] = array('price'=>$r,'from'=>$res[2][$k]); 

    return $re; 
} 

print_r(getPriceFor('samsung galaxy note 4')); 

那一定顯示是這樣的:

Array 
(
    [0] => Array 
     (
      [price] => 515,00 € 
      [from] => phoneshopping.it 
     ) 

    [1] => Array 
     (
      [price] => 519,00 € 
      [from] => Smartyrama 
     ) 

    [2] => Array 
     (
      [price] => 519,00 € 
      [from] => Smartyrama 
     ) 

    [3] => Array 
     (
      [price] => 519,00 € 
      [from] => Smartyrama 
     ) 

    [4] => Array 
     (
      [price] => 690,45 € 
      [from] => Amazon.it - Seller 
     ) 

    [5] => Array 
     (
      [price] => 673,99 € 
      [from] => da 2 negozi 
     ) 

    [6] => Array 
     (
      [price] => 345,00 € 
      [from] => da 2 negozi 
     ) 

    [7] => Array 
     (
      [price] => 342,00 € 
      [from] => Amazon.it - Seller 
     ) 

    [8] => Array 
     (
      [price] => 699,99 € 
      [from] => ePRICE.it 
     ) 

    [9] => Array 
     (
      [price] => 730,00 € 
      [from] => in oltre 5 negozi 
     ) 

    [10] => Array 
     (
      [price] => 20,00 € 
      [from] => Amazon.it - Seller 
     ) 

    [11] => Array 
     (
      [price] => 208,99 € 
      [from] => eGlobal Central Italia 
     ) 

    [12] => Array 
     (
      [price] => 711,00 € 
      [from] => in oltre 5 negozi 
     ) 

    [13] => Array 
     (
      [price] => 322,99 € 
      [from] => eGlobal Central Italia 
     ) 

    [14] => Array 
     (
      [price] => 40,09 € 
      [from] => da 4 negozi 
     ) 

    [15] => Array 
     (
      [price] => 15,99 € 
      [from] => acadattatore.com 
     ) 

    [16] => Array 
     (
      [price] => 339,99 € 
      [from] => ePRICE.it 
     ) 

    [17] => Array 
     (
      [price] => 412,90 € 
      [from] => da 3 negozi 
     ) 

    [18] => Array 
     (
      [price] => 343,33 € 
      [from] => Amazon.it - Seller 
     ) 

    [19] => Array 
     (
      [price] => 629,00 € 
      [from] => BestPriceStore 
     ) 

) 
+0

謝謝你和克里斯,我非常感謝你的支持。錫,當我嘗試你的代碼,我得到這個錯誤:致命錯誤:未知的異常'異常'與消息'SSL證書問題:無法獲得本地發行人證書'在C:\ xampp \ htdocs \ index.php:41堆棧跟蹤:#0 C:\ xampp \ htdocs \ index.php(50):getUrl('https://www.goo ...')#1 C:\ xampp \ htdocs \ index.php(63):getPriceFor 'samsung galaxy ...')#2 {main}拋出C:\ xampp \ htdocs \ index.php第41行' – leofabri

+1

我看到你正在使用windows。你必須爲curl設置一個ssl證書,或者使用file_get_contents而不是我調用的getUrl函數。我可以在幾個小時內給你進一步的指示。 – tin

+0

噢,你的程序在ubuntu中工作得很好。是的,最初我在XAMPP機器上使用過你的代碼,但是因爲我想在基於Ubuntu的機器上使用它,所以我不需要修改windows的代碼。我非常感謝你的支持,你已經清楚直接。 – leofabri

1

您應該使用解析器,而不是正則表達式來完成此任務。下面是使用simple html dom parser可以如何完成的一個示例。

include_once 'simple_html_dom.php'; 
$html = file_get_html('http://www.example.com'); 
foreach($html->find('span') as $element) { 
    if(strpos($element->class, 'price')){ 
     echo $element->innertext . "\n"; 
    } 
} 

這也是一個相當寬鬆的檢查,你可能會得到比你想要的更多的結果。它只是檢查跨度的類包含單詞price

http://simplehtmldom.sourceforge.net/manual.htm#section_quickstart

其他方法,How do you parse and process HTML/XML in PHP?