如何正確抓取某個html字符串中的某些節點？

我試圖抓住一些節點出我指定的HTML字符串：如何正確抓取某個html字符串中的某些節點？

$html = <<<'HTML' 
<h1>Details au&szlig;en</h1> 
<h1>Schreibmappe DIN A4</h1> 
<hr> 
<p>Die Au&szlig;enseite [...]</p> 
<p class="own-branding">[...]</p> 
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p> 
HTML;

我需要第一h1和字符串中的最後一個節點img。

爲此，我使用了DOMDocument，因爲使用preg_match_all或類似的東西我們可能會漏掉一些東西。

完整代碼：

$html = <<<'HTML' 
<h1>Details au&szlig;en</h1> 
<h1>Schreibmappe DIN A4</h1> 
<hr> 
<p>Die Au&szlig;enseite [...]</p> 
<p class="own-branding">[...]</p> 
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p> 
HTML; 

$dom = new \DOMDocument(); 
// since the libxml was designed for ISO-8859-1, this is a backwards hack 
// @see https://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258 
$dom->loadHTML(iconv('UTF-8', 'ISO-8859-1', $html), 
    \LIBXML_HTML_NOIMPLIED 
); 
$h1List = $dom->getElementsByTagName('h1'); 
$h1 = $h1List->item(0); 
$imgList = $dom->getElementsByTagName('img'); 
$img = $imgList->item($imgList->length - 1); 

$data = array(
    'tabTitle' => trim($dom->saveHTML($h1)), 
    'tabImg' => trim($dom->saveHTML($img)) 
); 


// remove both wrappers if empty 
$imgWrapper = $img->parentNode; 
$imgWrapper->removeChild($img); 

if (!$imgWrapper->hasChildNodes()) { 
    $imgWrapper->parentNode->removeChild($imgWrapper); 
} 

$h1Wrapper = $h1->parentNode; 
$h1Wrapper->removeChild($h1); 

if (!$h1Wrapper->hasChildNodes()) { 
    $h1Wrapper->parentNode->removeChild($h1Wrapper); 
} 

$data['content'] = $dom->saveHTML(); 

var_dump($data);

預期輸出：

array(
    'tabTitle' => '<h1>Details außen</h1>', 
    'tabImg' => '<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path=\'media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg\'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">', 
    'content' => ' 
<h1>Schreibmappe DIN A4</h1> 
<hr> 
<p>Die Au&szlig;enseite [...]</p> 
<p class="own-branding">[...]</p> 
<p> 
' 
);

，但我得到了以下的輸出：

array(3) { 
    'tabTitle' => 
    string(501) "<h1>Details außen<h1>Schreibmappe DIN A4</h1> 
<hr> 
<p>Die Außenseite [...]</p> 
<p class="own-branding">[...]</p> 
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p> 
</h1>" 
    'tabImg' => 
    string(373) "<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">" 
    'content' => 
    string(108) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 

" 
}

這裏有什麼錯？我正在使用PHP 5.6。如果問題與PHP版本相關，則可以更改爲PHP 7。

來源

2017-05-24 alpham8

你不應該有倍數H1在HTML – lloiacono

我從來沒有聽說過這個規矩這個。在我看來，這是沒有道理的。試想一下有索引的網站。第一個有序的標題是主要的一點，你用h2等直接指向它。無論如何，我GOOGLE了這個話題。基本上，是的，我們不應該。但這不是功能上的突破。 – alpham8

這應該讓你盯着。首先我使用xpath查詢DOMDocument，然後使用saveXML來打印節點。

$dom = new DOMDocument(); 
$dom->loadHTML($html); 

$xpath = new DOMXpath($dom); 

$nodes[] = $xpath->query('//h1')[0]; 
$nodes[] = $xpath->query('//img')[0]; 

foreach ($nodes as $node) { 
    echo utf8_decode($dom->saveXML($node)) . PHP_EOL; 
}

這是你的榜樣輸出：

<h1>Details außen</h1> 
<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"/>

您可以格式化成所需的輸出

來源

2017-05-24 14:23:30 lloiacono

如何正確抓取某個html字符串中的某些節點？

回答

相關問題