在解析html到dom樹中，如何在php中通過標記拆分字符串？

這裏是字符串：在解析html到dom樹中，如何在php中通過標記拆分字符串？

<div>This is a test.</div> 
<div>This <b>another</b> a test.</div> 
<div/> 
<div>This is last a test.</div>

我想下面的字符串分隔數組是這樣的：

{"This is a test.", "This <b>another</b> a test.", "", "This is last a test."}

任何主意，這樣做在PHP？謝謝。

來源

2011-07-28 Tattat

我不知道，如果你想一塊的代碼做這個/方法來自己寫（可能使用正則表達式？），但如果你只是想完成工作，你可能想看看http://simplehtmldom.sourceforge.net/。也許矯枉過正爲一個字符串使用一個大型圖書館，otoh，你可能需要稍後再解析？ – Nanne

我假設你的HTML是有意的格式不正確

有很多選擇，INCLUDIN XPath和衆多庫。 Regex is not a good idea。我發現DOMDocument快速和相對簡單。

getElementsByTagName然後迭代它們以獲取innerHTML。

例子：

<?php 
function get_inner_html($node) { 
    $innerHTML= ''; 
    $children = $node->childNodes; 
    foreach ($children as $child) { 
     $innerHTML .= $child->ownerDocument->saveXML($child); 
    } 

    return $innerHTML; 
} 
$str = <<<'EOD' 
<div>This is a test.</div> 
<div>This <b>another</b> a test.</div> 
<div/> 
<div>This is last a test.</div> 
EOD; 

$doc = new DOMDocument(); 
$doc->loadHTML($str); 
$ellies = $doc->getElementsByTagName('div'); 
foreach ($ellies as $one_el) { 
    if ($ih = get_inner_html($one_el)) 
     $array[] = $ih; 
} 
?> 
<pre> 
<?php print_r($array); ?> 
</pre> 

// Output 
// Note that there would be 
// a 4th array elemnt w/o the `if ($ih = get_inner_html($one_el))` check: 
Array 
(
    [0] => This is a test. 
    [1] => This <b>another</b> a test. 
    [2] => This is last a test. 
)

Try it out here

注：

以上將正常工作，只要你沒有嵌套DIVS。如果確實有嵌套，則必須在循環訪問innerHTML時排除嵌套子元素。

例如假設你有這樣的HTML：

<div>One 
    <div>Two</div> 
    <div>Three</div> 
<div/> 
<div>Four 
    <div>Five</div> 
</div>

這裏是如何應對上面的，並得到了爲了數字數組：築巢

處理

<?php function get_inner_html_unnested($node, $exclude) { $innerHTML= ''; $children = $node->childNodes; foreach ($children as $child) { if (!property_exists($child, 'tagName') || ($child->tagName != $exclude)) $innerHTML .= trim($child->ownerDocument->saveXML($child)); } return $innerHTML; } $str = <<<'EOD' <div>One <div>Two</div> <div>Three</div> <div/> <div>Four <div>Five</div> </div> EOD; $doc = new DOMDocument(); $doc->loadHTML($str); $ellies = $doc->getElementsByTagName('div'); foreach ($ellies as $one_el) { if ($ih = get_inner_html_unnested($one_el, 'div')) $array[] = $ih; } ?> <pre> <?php print_r($array); ?> </pre>

Try it out here

來源

2011-07-28 06:18:00

哇，漂亮！ – Tattat

此make_array功能應該爲你做的伎倆：

function make_array($string) 
{ 
    $regexp = "(\s*</?div/?>\s*)+"; 
    $string = preg_replace("@^[email protected]", "", $string); 
    $string = preg_replace("@[email protected]", "", $string); 
    return preg_split("@[email protected]", $string); 
}

當傳遞給你作爲一個例子字符串，它輸出以下數組：

Array 
(
    [0] => "This is a test." 
    [1] => "This <b>another</b> a test." 
    [2] => "This is last a test." 
)

來源

2011-07-28 06:20:50

yes 1

否

yes 2

''？ ==> http://codepad.viper-7.com/wZycaN –

正如在接受的答案中提到的：[正則表達式不是一個好主意]（http://www.codinghorror.com/blog/2009/11/parsing -html最邪神-way.html） – nietonfir

在解析html到dom樹中，如何在php中通過標記拆分字符串？

回答

相關問題