一個PHP的HTML解析器，讓我做類選擇並獲得父節點

所以我在一個情況下，我用PHP抓取一個網站，我需要能夠得到一個基於它的CSS類的節點。我需要得到一個沒有id屬性但有一個css類的ul標籤。我，那麼只需要獲取裏面的li標籤，其中包含特定的錨標籤，而不是所有的li標籤。一個PHP的HTML解析器，讓我做類選擇並獲得父節點

我已經瀏覽了DOMDocument，Zend_Dom，既沒有要求，類選擇和dom遍歷（特別是父母升序）。

來源

2011-12-21 user594044

DOMDocument可以做到這一點，你應該包括一些HTML樣本 – ajreal 2011-12-21 02:55:56

你可以使用querypath，然後像這可能工作：

htmlqp($html)->find("ul.class")->not("#id") 
      ->find('li a[href*="specific"]')->parent() 
// then foreach over it or use ->writeHTML() for extraction

的API見http://api.querypath.org/docs/class_query_path.html。

（穿越是很容易，如果你不使用繁瑣的DOMDocument。）

來源

2011-12-21 03:04:33 mario

我有好運氣： http://simplehtmldom.sourceforge.net/

來源

2011-12-21 03:23:06 Thor

爲此，您可以用DOMDocument和DOMXPath。在XPath中按類選擇是一件痛苦的事，但它可以完成。

下面是一些示例HTML（而且完全合法！）：

$html = <<<EOT 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 
<title>Document Title</title> 
<ul id="myid"><li>myid-listitem1</ul> 
<ul class="foo 
theclass 
"><li>list2-item1<li>list2-item2</ul> 
<ul id="myid2" class="foo&#xD;theclass bar"><li>list3-item1<li>list3-item2</ul> 
EOT 
; 

$doc = new DOMDocument(); 
$doc->loadHTML($html); 
$xp = new DOMXPath($doc); 
$nodes = $xp->query("/html/body//ul[not(@id) and contains(concat(' ',normalize-space(@class),' '), ' theclass ')]"); 

var_dump($nodes->length);

如果你使用PHP 5.3，你可以通過註冊在PHP中的XPath功能簡化了這個有點。（請注意，您可以通過XSLTProcessor開始在PHP 5.1註冊在XPath表達式中使用的功能，但不能直接用於DOMXPath）

function hasToken($nodearray, $token) { 
    foreach ($nodearray as $node) { 
     if ($node->nodeValue===null or !hasTokenS($node->nodeValue, $token)) { 
      return False; 
     } 
    } 
    return True; 
    // I could even return nodes or document fragments if I wanted! 
} 
function hasTokenS($str, $token) { 
    $str = trim($str, "\r\n\t "); 
    $tokens = preg_split('/[\r\n\t ]+/', $str); 
    return in_array($token, $tokens); 
} 

$xp->registerNamespace('php', 'http://php.net/xpath'); 
$xp->registerPhpFunctions(array('hasToken', 'hasTokenS')); 

// These two are equivalent: 
$nodes1 = $xp->query("/html/body//ul[not(@id) and php:function('hasToken', @class, 'theclass')]"); 
$nodes2 = $xp->query("/html/body//ul[not(@id) and php:functionString('hasTokenS', @class, 'theclass')]"); 

var_dump($nodes1->length); 
var_dump($nodes1->item(0)); 
var_dump($nodes2->length); 
var_dump($nodes2->item(0));

如果DOMDocument是不解析您的HTML非常好，你可以使用html5lib解析器，這將返回一個DOMDocument：

require_once('lib/HTML5/Parser.php'); // or where-ever you put it 
$dom = HTML5_Parser::parse($html); 
// $dom is a plain DOMDocument object, created according to html5 parsing rules

來源

2011-12-21 04:06:04

一個PHP的HTML解析器，讓我做類選擇並獲得父節點

回答

相關問題