DOMDocument解析HTML（而不是正則表達式）

我想學習使用DOMDocument解析HTML代碼。DOMDocument解析HTML（而不是正則表達式）

我只是做了一些簡單的工作，我已經喜歡戈登的回答scrap data using regex and simplehtmldom，並根據他的工作我的代碼。

由於信息有限，幾乎沒有任何示例，我發現PHP.net上的文檔不太好，大多數細節都基於解析XML。

<?php 
$dom = new DOMDocument; 
libxml_use_internal_errors(true); 
$dom->loadHTMLFile('http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html'); 
libxml_clear_errors(); 

$recipe = array(); 
$xpath = new DOMXPath($dom); 
$contentDiv = $dom->getElementById('page'); // would have preferred getContentbyClass('content') (unique) in this case. 

# title 
print_r($xpath->evaluate('string(div/div/div/div/div/h1)', $contentDiv)); 

# content (this is not working) 
#print_r($xpath->evaluate('string(div/div/div/div['content'])', $contentDiv)); // if only this worked 
print_r($xpath->evaluate('string(div/div/div/div)', $contentDiv)); 
?>

出於測試目的，我試圖獲取nu.nl新聞文章的標題（h1標籤）和內容（HTML）。

正如你所看到的，我可以得到標題，雖然我對評估字符串並不滿意，因爲它恰好是該div級別上唯一的h1標記。

來源

2011-09-06 Dennis

你爲什麼不在xpath字符串中搜索'h1'？ –

這裏是你如何能與DOM和XPath做到這一點：

$dom = new DOMDocument; 
libxml_use_internal_errors(true); 
$dom->loadHTMLFile('http://www.nu.nl/…'); 
libxml_clear_errors(); 

$xpath = new DOMXPath($dom); 
echo $xpath->evaluate('string(id("leadarticle")/div/h1)'); 
echo $dom->saveHtml(
    $xpath->evaluate('id("leadarticle")/div[@class="content"]')->item(0) 
);

中的XPath string(id("leadarticle")/div/h1)將返回是一個div的孩子是元素的孩子與ID的H1的的textContent leadarticle。

XPath id("leadarticle")/div[@class="content"]將返回具有id爲leadarticle的元素的子元素的類屬性內容的div。

因爲您需要內容div的outerHTML，您必須獲取整個節點而不僅僅是內容，因此不需要string() function in the XPath。將節點傳遞給DOMDocument::saveHTML()方法（which is only possible as of 5.3.6）將會將該節點序列化回HTML。

來源

2011-09-06 19:53:03 Gordon

你不應該打擾原始的DOMDocument接口。而是使用其中一種jQuery風格的類進行提取。 How to parse HTML with PHP?

的QueryPath似乎如果你使用更具體的選擇做工精細：

include "qp.phar"; 
$qp = htmlqp("http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html"); 

print $qp->find(".header h1")->text(); 
print $qp->top()->find(".article .content")->xhtml();

您可能需要先然而（->find("script")->remove()）剝去混雜的JavaScript。

來源

2011-09-06 19:27:34 mario

我會+1這個，但**絕對不贊同** *你**不應該**原材料DOMDocument *打擾。一個*不應該打擾SimpleHtmlDOM，但DOMDocument是一個很好的語言不可知接口和PHP擴展，技術上它包含了OP所需的一切。這些第三方庫只是增加了方便。 – Gordon

DOMDocument解析HTML（而不是正則表達式）

回答

相關問題