使用PHP中的DomDocument從葉節點提取文本

我使用PHP檢索不同的網頁，然後將它們加載到DomDocument，但是我遇到了僅從葉節點中提取文本的問題。使用PHP中的DomDocument從葉節點提取文本

例如，假設我有以下幾點：

<html> 
    <body> 
     <div class="this_is_our_div_of_interest"> 
      <div> 
       <div> 
        <p>Some text</p> 
        <div>Some <a href='#'>more</a> text</div> 
        <p>And <span><strong>another</strong></span> paragraph</p> 
       </div> 
       <p>Yay<p> 
      </div> 
      <div> 
       <h4>abcd</ph4> 
       xyz 
      <div> 
     </div> 
     <div class="we_do_not_want_those_divs"> 
      <p>This text is not important to us</p> 
     </div> 
    </body> 
</html>

正如你可以看到，這是一個混亂的輸入，但是預期「echo'ed」輸出爲：

Some text 
Some more text 
And another paragraph 
Yay 
abcd 
xyz

注在輸出以下

我只檢索特定的標籤輸出（在我們〔實施例，this_is_our_div_of_interest）
這是不是一個具體的格式爲上面提供的樹，因爲它來自網頁tjat我無法控制其內容，但是，我只喜歡帶來標籤的內容，如div和p似乎是葉節點
有一些標籤需要被中省略，例如一個，跨度，並強（其它可能添加到列表）

UPDATE 我使用XPath去的類，例如，下面的代碼行將把所有decendents爲separete節點：

$nodes = $xpath->query("//div[@class='this_is_our_div_of_interest']/descendant::*");

來源

2013-07-14 Greeso

你可以這樣做：

$dom = new DOMDocument(); $dom->loadHTMLFile('file.html'); 
$id = $dom->getElementById('youNeedAnIdForThis');

現在訪問$id。

遺憾的是沒有getElementsByClassName，但我在http://pastebin.com/4qYMEGqV找到了一個。然後，你的代碼看起來像：

$dom = new DOMDocument(); $dom->loadHTMLFile('file.html'); 
$class = getElementsByClassName($dom, 'this_is_our_div_of_interest');

$class[0]現在應該抱你在找什麼

那麼也許你應該strip_tags()，如果你只是想文本。

也許看看DOMNode http://www.php.net/manual/en/class.domnode.php#domnode.props.childnodes？

來源

2013-07-14 23:21:56 PHPglue

嗯，謝謝你的回答，我知道如何通過課堂檢索，我會更新這個問題。主要問題是如何遍歷葉節點，我應該使用哪個xpath變量！ – Greeso

使用PHP中的DomDocument從葉節點提取文本

回答

相關問題