我假設你的HTML是有意的格式不正確
有很多選擇,INCLUDIN XPath和衆多庫。 Regex is not a good idea。我發現DOMDocument快速和相對簡單。
getElementsByTagName然後迭代它們以獲取innerHTML。
例子:
<?php
function get_inner_html($node) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML($child);
}
return $innerHTML;
}
$str = <<<'EOD'
<div>This is a test.</div>
<div>This <b>another</b> a test.</div>
<div/>
<div>This is last a test.</div>
EOD;
$doc = new DOMDocument();
$doc->loadHTML($str);
$ellies = $doc->getElementsByTagName('div');
foreach ($ellies as $one_el) {
if ($ih = get_inner_html($one_el))
$array[] = $ih;
}
?>
<pre>
<?php print_r($array); ?>
</pre>
// Output
// Note that there would be
// a 4th array elemnt w/o the `if ($ih = get_inner_html($one_el))` check:
Array
(
[0] => This is a test.
[1] => This <b>another</b> a test.
[2] => This is last a test.
)
Try it out here
注:
以上將正常工作,只要你沒有嵌套DIVS。如果確實有嵌套,則必須在循環訪問innerHTML時排除嵌套子元素。
例如假設你有這樣的HTML:
<div>One
<div>Two</div>
<div>Three</div>
<div/>
<div>Four
<div>Five</div>
</div>
這裏是如何應對上面的,並得到了爲了數字數組:築巢
處理
<?php function get_inner_html_unnested($node, $exclude) { $innerHTML= ''; $children = $node->childNodes; foreach ($children as $child) { if (!property_exists($child, 'tagName') || ($child->tagName != $exclude)) $innerHTML .= trim($child->ownerDocument->saveXML($child)); } return $innerHTML; } $str = <<<'EOD' <div>One <div>Two</div> <div>Three</div> <div/> <div>Four <div>Five</div> </div> EOD; $doc = new DOMDocument(); $doc->loadHTML($str); $ellies = $doc->getElementsByTagName('div'); foreach ($ellies as $one_el) { if ($ih = get_inner_html_unnested($one_el, 'div')) $array[] = $ih; } ?> <pre> <?php print_r($array); ?> </pre>
Try it out here
我不知道,如果你想一塊的代碼做這個/方法來自己寫(可能使用正則表達式?),但如果你只是想完成工作,你可能想看看http://simplehtmldom.sourceforge.net/。也許矯枉過正爲一個字符串使用一個大型圖書館,otoh,你可能需要稍後再解析? – Nanne