2012-12-17 39 views
2

我有這個查詢提取已被「喜歡」超過5次的帖子。通過XPath提取HTML字段

//div[@class="pin"] 
[.//span[@class = "LikesCount"] 
[substring-before(normalize-space(text())," ") > 5] 

我想提取和保存如標題,圖像網址,像數,列賓號,附加的信息...

如何提取它們呢?

  • 多個XPath查詢?
  • 在使用php和php函數進行迭代時,挖掘結果帖子的節點?
  • ...

遵循標記示例:

<div class="pin"> 

<p class="description">gorgeous couch <a href="#">#modern</a></p> 

[...] 

<div class="PinHolder"> 
<a href="/pin/56787645270909880/" class="PinImage ImgLink"> 
    <img src="http://media-cache-ec3.pinterest.com/upload/56787645270909880_d7AaHYHA_b.jpg" 
     alt="Krizia" 
     data-componenttype="MODAL_PIN" 
     class="PinImageImg" 
     style="height: 288px;"> 
</a> 
</div> 

<p class="stats colorless"> 
    <span class="LikesCount"> 
     22 likes 
    </span> 
    <span class="RepinsCount"> 
     6 repins 
    </span> 
</p> 

[...] 

</div> 
+0

'最好'的方式對你來說意味着什麼? – hek2mgl

+0

一旦我找到合適的帖子,我不知道哪個是最好的方法來提取商店,並組織所有這些信息 –

回答

2

因爲你已經在你的代碼中使用XPath我建議提取使用XPath過該信息。這裏有一個關於如何提取描述的例子。

<?php 

// will store the posts as assoc arrays 
$mostLikedPostsArr = array(); 

// call your fictional load function 
$doc = load_html('whatever'); 

// create a XPath selector 
$selector = new DOMXPath($doc); 

// this your query from above 
$query = '//div[@class="pin"][.//span[@class = "LikesCount"][substring-before(normalize-space(text())," ") > 5]'; 

// getting the most liked posts 
$mostLikedPosts = $selector->query($query); 

// now iterate through the post nodes 
foreach($mostLikedPosts as $post) { 

    // assoc array for a post 
    $postArr = array(); 

    // you can do 'relative' queries once having a reference to $post 
    // note $post as the second parameter to $selector->query() 

    // lets extract the description for example 
    $result = $selector->query('p[@class = "description"]', $post); 
    // just using nodeValue might be ok for text only nodes. 
    // to properly flatten the <a> tags inside the descriptions 
    // it will take further attention. 
    $postArr['description'] = $result->item(0)->nodeValue; 

    // ... 

    $mostLikedPostsArr []= $postArr; 
} 
+0

看起來很酷!但是我不明白你如何以這種方式提取描述:$ node是在哪裏定義的?爲什麼不分配查詢結果? –

+1

抱歉沒有試過編碼。當然$節點沒有被定義。 ;)更新它。 – hek2mgl

+1

請注意,您不應使用$ result-> item(0)而不檢查$ result-> length。只是想盡可能保持簡單。 – hek2mgl