1
我試圖抓住一些節點出我指定的HTML字符串:如何正確抓取某個html字符串中的某些節點?
$html = <<<'HTML'
<h1>Details außen</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;
我需要第一h1
和字符串中的最後一個節點img
。
爲此,我使用了DOMDocument,因爲使用preg_match_all
或類似的東西我們可能會漏掉一些東西。
完整代碼:
$html = <<<'HTML'
<h1>Details außen</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;
$dom = new \DOMDocument();
// since the libxml was designed for ISO-8859-1, this is a backwards hack
// @see https://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258
$dom->loadHTML(iconv('UTF-8', 'ISO-8859-1', $html),
\LIBXML_HTML_NOIMPLIED
);
$h1List = $dom->getElementsByTagName('h1');
$h1 = $h1List->item(0);
$imgList = $dom->getElementsByTagName('img');
$img = $imgList->item($imgList->length - 1);
$data = array(
'tabTitle' => trim($dom->saveHTML($h1)),
'tabImg' => trim($dom->saveHTML($img))
);
// remove both wrappers if empty
$imgWrapper = $img->parentNode;
$imgWrapper->removeChild($img);
if (!$imgWrapper->hasChildNodes()) {
$imgWrapper->parentNode->removeChild($imgWrapper);
}
$h1Wrapper = $h1->parentNode;
$h1Wrapper->removeChild($h1);
if (!$h1Wrapper->hasChildNodes()) {
$h1Wrapper->parentNode->removeChild($h1Wrapper);
}
$data['content'] = $dom->saveHTML();
var_dump($data);
預期輸出:
array(
'tabTitle' => '<h1>Details außen</h1>',
'tabImg' => '<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path=\'media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg\'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">',
'content' => '
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p>
'
);
,但我得到了以下的輸出:
array(3) {
'tabTitle' =>
string(501) "<h1>Details außen<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
</h1>"
'tabImg' =>
string(373) "<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">"
'content' =>
string(108) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
"
}
這裏有什麼錯?我正在使用PHP 5.6。如果問題與PHP版本相關,則可以更改爲PHP 7。
你不應該有倍數H1在HTML – lloiacono
我從來沒有聽說過這個規矩這個。在我看來,這是沒有道理的。試想一下有索引的網站。第一個有序的標題是主要的一點,你用h2等直接指向它。無論如何,我GOOGLE了這個話題。基本上,是的,我們不應該。但這不是功能上的突破。 – alpham8