2015-10-27 11 views
1

我在一個看起來像this(pastebin url)的html頁面中有一張表。用路徑刮一張表

我想抓住從表中的內容,對本代碼是:

$html = htmlspecialchars("https://localhost/table.php"); 

$doc = new \DOMDocument(); 

if($doc->loadHTML($html)) 
{ 
    $result = new \DOMDocument(); 
    $result->formatOutput = true; 
    $table = $result->appendChild($result->createElement("table")); 
    $thead = $table->appendChild($result->createElement("thead")); 
    $tbody = $table->appendChild($result->createElement("tbody")); 

    $xpath = new \DOMXPath($doc); 

    $newRow = $thead->appendChild($result->createElement("tr")); 

    foreach($xpath->query("//table[@id='kurstabell']/thead/tr/th[position()>1]") as $header) 
    { 
     $newRow->appendChild($result->createElement("th", trim($header->nodeValue))); 
    } 

    foreach($xpath->query("//table[@id='kurstabell']/tbody/tr") as $row) 
    { 
     $newRow = $tbody->appendChild($result->createElement("tr")); 

     foreach($xpath->query("./td[position()>1]", $row) as $cell) 
     { 
      $newRow->appendChild($result->createElement("td", trim($cell->nodeValue))); 
     } 
    } 

    echo $result->saveXML($result->documentElement); 
} 

print_r($result); 

(IM使用的htmlspecialchars因爲libxml_use_internal_errors(true);產生錯誤代碼Europe/Berlin] PHP Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line:所以我在其他地方見過那個用htmlspecialchars確定使用)

這個snipp目前的結果是這樣的:

DOMDocument Object ([doctype] => [implementation] => (object value omitted) [documentElement] => (object value omitted) [actualEncoding] => [encoding] => [xmlEncoding] => [standalone] => 1 [xmlStandalone] => 1 [version] => 1.0 [xmlVersion] => 1.0 [strictErrorChecking] => 1 [documentURI] => [config] => [formatOutput] => 1 [validateOnParse] => [resolveExternals] => [preserveWhiteSpace] => 1 [recover] => [substituteEntities] => [nodeName] => #document [nodeValue] => [nodeType] => 9 [parentNode] => [childNodes] => (object value omitted) [firstChild] => (object value omitted) [lastChild] => (object value omitted) [previousSibling] => [attributes] => [ownerDocument] => [namespaceURI] => [prefix] => [localName] => [baseURI] => [textContent] =>) 

php_error.log沒有給我任何錯誤。

預期的結果是相同的表,在html中回顯,但刪除了所有「不必要的」代碼。

我的問題: 這段代碼有什麼問題?

回答

0

問題是與第一行:

$html = htmlspecialchars("https://localhost/table.php"); 

它應該僅僅是:

$html = file_get_contents("https://localhost/table.php"); 

htmlspecialchars()逃逸,當被loadHTML()解析返回一個文本節點,而所有的HTML標籤功能比預期的DOM。