PHP：DOM文檔：從嵌套元素

刪除不需要的文本，我有以下XML文檔：PHP：DOM文檔：從嵌套元素

<?xml version="1.0" encoding="UTF-8"?> 
<header level="2">My Header</header> 
<ul> 
    <li>Bulleted style text 
     <ul> 
      <li> 
       <paragraph>1.Sub Bulleted style text</paragraph> 
      </li> 
     </ul> 
    </li> 
</ul> 
<ul> 
    <li>Bulleted style text <strong>bold</strong> 
     <ul> 
      <li> 
       <paragraph>2.Sub Bulleted <strong>bold</strong></paragraph> 
      </li> 
     </ul> 
    </li> 
</ul>

我需要刪除前述小組項目符號文本的數字。1和2中給出的例子

這是我的代碼至今：

<?php 
class MyDocumentImporter 
{ 
    const AWKWARD_BULLET_REGEX = '/(^[\s]?[\d]+[\.]{1})/i'; 

    protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>'; 

    protected $dom; 

    public function processListsText($loop = null){ 

     $this->dom = new DomDocument('1.0', 'UTF-8'); 

     $this->dom->loadXML($this->xml_string); 

     if(!$loop){ 
      //get all the li tags 
      $li_set = $this->dom->getElementsByTagName('li'); 
     } 
     else{ 
      $li_set = $loop; 
     } 

     foreach($li_set as $li){ 

      //check for child nodes 
      if(! $li->hasChildNodes()){ 
       continue; 
      } 

      foreach($li->childNodes as $child){ 
       if($child->hasChildNodes()){ 
        //this li has children, maybe a <strong> tag 
        $this->processListsText($child->childNodes); 
       } 
       if(! ($child instanceof DOMElement)){ 
        continue; 
       } 
       if(($child->localName != 'paragraph') || ($child instanceof DOMText)){ 
        continue; 
       } 
       if(preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0){ 
        continue; 
       } 

       $clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent); 

       //set node to empty 
       $child->nodeValue = ''; 

       //add updated content to node 
       $child->appendChild($child->ownerDocument->createTextNode($clean_content)); 

       //$xml_output = $child->parentNode->ownerDocument->saveXML($child); 
       //var_dump($xml_output); 

      } 
     } 
    } 
} 

$importer = new MyDocumentImporter(); 
$importer->processListsText();

我可以看到的問題是$child->textContent返回節點的純文本內容，並去除額外的子標籤。所以：

<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>

成爲

<paragraph>Sub Bulleted bold</paragraph>

的<strong>標籤是沒有更多的。

我有點難住......任何人都可以看到一種方法去除不需要的字符，並保留「內心的孩子」<strong>標籤？

該標籤可能並不總是<strong>，也可能是超鏈接<a href="#">或<emphasize>。

來源

2013-05-21 gArn

這甚至不能正確解析爲XML。 –

@Jack：他的格式化示例沒有，他的內聯代碼示例。 – Wrikken

您可以使用'\ .'而不是'[\。] {1}'順便說一句。 – Wrikken

假設你的XML解析其實，你可以使用XPath，使您的查詢輕鬆了不少：

$xp = new DOMXPath($this->dom); 

foreach ($xp->query('//li/paragraph') as $para) { 
     $para->firstChild->nodeValue = preg_replace('/^\s*\d+.\s*/', '', $para->firstChild->nodeValue); 
}

它做的第一文本節點，而不是整個標籤內容的文本替換。

來源

2013-05-21 17:16:28

感謝您的幫助，我發佈的代碼只是多面文檔處理類的一部分。希望使用xpath提示，我將能夠清理很多代碼！ – gArn

您重置其整個內容，但是您想要的只是更改第一個文本節點（請記住文本節點也是節點）。您可能需要查找xpath //li/paragraph/text()[position()=1]，並且處理/替換該DOMText節點而不是整個段落內容。

$d = new DOMDocument(); 
$d->loadXML($xml); 
$p = new DOMXPath($d); 
foreach($p->query('//li/paragraph/text()[position()=1]') as $text){ 
     $text->parentNode->replaceChild(new DOMText(preg_replace(self::AWKWARD_BULLET_REGEX, '', $text->textContent),$text); 
}

來源

2013-05-21 17:07:17 Wrikken

PHP：DOM文檔：從嵌套元素

回答

相關問題