刪除不需要的文本,我有以下XML文檔:PHP:DOM文檔:從嵌套元素
<?xml version="1.0" encoding="UTF-8"?>
<header level="2">My Header</header>
<ul>
<li>Bulleted style text
<ul>
<li>
<paragraph>1.Sub Bulleted style text</paragraph>
</li>
</ul>
</li>
</ul>
<ul>
<li>Bulleted style text <strong>bold</strong>
<ul>
<li>
<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
</li>
</ul>
</li>
</ul>
我需要刪除前述小組項目符號文本的數字。1和2中給出的例子
這是我的代碼至今:
<?php
class MyDocumentImporter
{
const AWKWARD_BULLET_REGEX = '/(^[\s]?[\d]+[\.]{1})/i';
protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';
protected $dom;
public function processListsText($loop = null){
$this->dom = new DomDocument('1.0', 'UTF-8');
$this->dom->loadXML($this->xml_string);
if(!$loop){
//get all the li tags
$li_set = $this->dom->getElementsByTagName('li');
}
else{
$li_set = $loop;
}
foreach($li_set as $li){
//check for child nodes
if(! $li->hasChildNodes()){
continue;
}
foreach($li->childNodes as $child){
if($child->hasChildNodes()){
//this li has children, maybe a <strong> tag
$this->processListsText($child->childNodes);
}
if(! ($child instanceof DOMElement)){
continue;
}
if(($child->localName != 'paragraph') || ($child instanceof DOMText)){
continue;
}
if(preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0){
continue;
}
$clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent);
//set node to empty
$child->nodeValue = '';
//add updated content to node
$child->appendChild($child->ownerDocument->createTextNode($clean_content));
//$xml_output = $child->parentNode->ownerDocument->saveXML($child);
//var_dump($xml_output);
}
}
}
}
$importer = new MyDocumentImporter();
$importer->processListsText();
我可以看到的問題是$child->textContent
返回節點的純文本內容,並去除額外的子標籤。所以:
<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
成爲
<paragraph>Sub Bulleted bold</paragraph>
的<strong>
標籤是沒有更多的。
我有點難住......任何人都可以看到一種方法去除不需要的字符,並保留「內心的孩子」<strong>
標籤?
該標籤可能並不總是<strong>
,也可能是超鏈接<a href="#">
或<emphasize>
。
這甚至不能正確解析爲XML。 –
@Jack:他的格式化示例沒有,他的內聯代碼示例。 – Wrikken
您可以使用'\ .'而不是'[\。] {1}'順便說一句。 – Wrikken