通過XPath提取HTML字段

我有這個查詢提取已被「喜歡」超過5次的帖子。通過XPath提取HTML字段

//div[@class="pin"] 
[.//span[@class = "LikesCount"] 
[substring-before(normalize-space(text())," ") > 5]

我想提取和保存如標題，圖像網址，像數，列賓號，附加的信息...

如何提取它們呢？

多個XPath查詢？
在使用php和php函數進行迭代時，挖掘結果帖子的節點？
...

遵循標記示例：

<div class="pin"> 

<p class="description">gorgeous couch <a href="#">#modern</a></p> 

[...] 

<div class="PinHolder"> 
<a href="/pin/56787645270909880/" class="PinImage ImgLink"> 
    <img src="http://media-cache-ec3.pinterest.com/upload/56787645270909880_d7AaHYHA_b.jpg" 
     alt="Krizia" 
     data-componenttype="MODAL_PIN" 
     class="PinImageImg" 
     style="height: 288px;"> 
</a> 
</div> 

<p class="stats colorless"> 
    <span class="LikesCount"> 
     22 likes 
    </span> 
    <span class="RepinsCount"> 
     6 repins 
    </span> 
</p> 

[...] 

</div>

來源

2012-12-17 Andrea Puiatti

'最好'的方式對你來說意味着什麼？ – hek2mgl

一旦我找到合適的帖子，我不知道哪個是最好的方法來提取商店，並組織所有這些信息 –

因爲你已經在你的代碼中使用XPath我建議提取使用XPath過該信息。這裏有一個關於如何提取描述的例子。

<?php 

// will store the posts as assoc arrays 
$mostLikedPostsArr = array(); 

// call your fictional load function 
$doc = load_html('whatever'); 

// create a XPath selector 
$selector = new DOMXPath($doc); 

// this your query from above 
$query = '//div[@class="pin"][.//span[@class = "LikesCount"][substring-before(normalize-space(text())," ") > 5]'; 

// getting the most liked posts 
$mostLikedPosts = $selector->query($query); 

// now iterate through the post nodes 
foreach($mostLikedPosts as $post) { 

    // assoc array for a post 
    $postArr = array(); 

    // you can do 'relative' queries once having a reference to $post 
    // note $post as the second parameter to $selector->query() 

    // lets extract the description for example 
    $result = $selector->query('p[@class = "description"]', $post); 
    // just using nodeValue might be ok for text only nodes. 
    // to properly flatten the <a> tags inside the descriptions 
    // it will take further attention. 
    $postArr['description'] = $result->item(0)->nodeValue; 

    // ... 

    $mostLikedPostsArr []= $postArr; 
}

來源

2012-12-17 14:18:56 hek2mgl

看起來很酷！但是我不明白你如何以這種方式提取描述：$ node是在哪裏定義的？爲什麼不分配查詢結果？ –

抱歉沒有試過編碼。當然$節點沒有被定義。 ;）更新它。 – hek2mgl

請注意，您不應使用$ result-> item（0）而不檢查$ result-> length。只是想盡可能保持簡單。 – hek2mgl

通過XPath提取HTML字段

回答

相關問題