獲得這些數據的最有效方式是成千上萬次？

使用PHP的DOMDocument-> loadHTML（）系統獲取以下數據（</b>標籤後4.0m）的最佳方式是什麼？我在猜測某種CSS樣式選擇器？獲得這些數據的最有效方式是成千上萬次？

(LINE 240, always 240) <b>Current Price:</b> 4.0m

我一直在尋找的文件左右，但說實話，這是所有完全陌生的我！此外，我將如何能夠獲得這些數據的數千頁，從網址，如：

http://site.com/q=item/viewitem.php?obj=11928

的obj=#最小值/最大值是已知的（我需要多少頁刮），和我希望逐步獲取所有這些數據，並且輸出namedescription和price（並非非常擔心迄今爲止的百分比上升/下降），因此我可以從中獲取該數據並將其顯示在我的網站中。

這裏是代碼的主要塊，我感興趣的是：

<div class="subsectionHeader"> 
<h2> 
Item Name 
</h2> 
</div> 
<div id="item_additional" class="inner_brown_box"> 
Description of item goes here. 
<br> 
<br> 
<b>Current Price:</b> 4.0m 
<br><br> 
<b>Change in Price:</b><br> 
<span> 
<b>30 Days:</b> <span class="rise">+2.5%</span> 
</span> 
<span class="spaced_span"> 
<b>90 Days:</b> <span class="drop">-30.4%</span> 
</span> 
<span class="spaced-span"> 
<b>180 Days:</b> <span class="drop">-33.3%</span> 
</span> 
<br class="clear"> 
</div> </div> <div class="brown_box main_page"> 
<div class="subsectionHeader"> `

如果有人可以提供關於如何去這個任何骨骼暗示，這將是非常感謝！

來源

2011-03-14 Noir

是不是您可以訪問哪些RSS提要？刮擦幾乎被普遍認爲是不好的形式。 – 2011-03-14 23:57:11

可能的重複[什麼是最有效的方法來刮 - >存儲 - >顯示此信息？]（http://stackoverflow.com/questions/5305436/whats-the-most-efficient-way-to-scrape-store -display-此信息） – 2011-03-14 23:57:52

你可以使用簡單的HTML DOM解析器 - http://simplehtmldom.sourceforge.net/

使用提取的內容：

echo file_get_html('http://www.google.com/')->plaintext;

然後找到使用PHP STR功能4.0米。

來源

2011-03-14 23:58:32 Kit

DOM解析是執行此操作最可靠的方法。

如果你想以最快的方式，並且知道HTML的結構是一致的，它會可能是更快地使用strpos搜索偏移。不過，如果頁面結構發生變化，它很可能會中斷。類似這樣的：

$needles = array(
    'name' => "<div class=\"subsectionHeader\">\n<h2>\n" 
    'description' => "<div id=\"item_additional\" class=\"inner_brown_box\">\n" 
    'price' => "<b>Current Price:</b> " 
); 
$buffer = file_get_contents("http://site.com/q=item/viewitem.php?obj=1234"); 
$result = array(); 
foreach ($needles as $key => $needle) { 
    $index1 = strpos($buffer, $needle); 
    $index2 = strpos($buffer, "\n", $index1); 
    $value = substr($buffer, $index1, $index2 - $index1); 
    $result[$key] = $value; 
}

您需要將針正確對準，包括任何尾隨的空白。

來源

2011-03-15 00:09:38

用正則表達式解析HTML通常是個壞主意，但在你的情況下，它可能是我的正確/簡單的方法。它速度很快，可能比分塊和純文本模式更靈活。

嘗試用HTML源這個例子上面給出：

//checked with php 5.3.3 
if (preg_match('#<h2>(?P<itemName>[^>]+)</h2>.*?<div[^>]+id=([\'"])item_additional(\2)[^>]*>\s*(?P<description>[^<]+).*?<b>\s*Current\s+Price\s?:?</b>\s*(?P<price>[^<]+)#six',$src, $matches)) 
{ 
    print_r($matches); 
}

正則表達式可能看起來太複雜，但documenation和漂亮的工具，如使用RegexBuddy或快報任何人都可以編寫簡單的人;）

來源

2011-03-15 03:17:49 pietrovich

獲得這些數據的最有效方式是成千上萬次？

回答

相關問題