2014-05-21 24 views
0

是否有任何簡單的方法可以在具有格式錯誤的XML的2個標記之間查找文本並忽略嵌套?在格式不正確的XML和包含嵌套標記的2個標記之間查找文本

鑑於此內容:

<div> 
    Some content 1 
    </ 
    <some:tag> 
     Section 1 
    </some:tag> 
    <b>Some content 2 
    <some:tag> 
     Section 2 
     <some:tag> 
      Section 3 
     </some:tag> 
    </some:tag> 
    Some content 3 
    </p> 
</div> 

注:這是故意畸形。我不能/不想使用正確的HTML/XML解析器,因爲我的內容沒有正確形成,或者在某些情況下甚至不是XML。同樣,我不能/不希望對它進行整理,因爲它並不總是HTML/XML。

所以我需要找到<some:tag></some:tag>之間的文本,包括嵌套標籤。

內容上面會導致:

array (size=2) 

    0 => string '<some:tag> 
      Section 1 
     </some:tag>' (length=52) 

    1 => string '<some:tag> 
      Section 2 
      <some:tag> 
       Section 3 
      </some:tag> 
     </some:tag>' (length=125) 

強制性你嘗試過什麼:

我一直在使用strpos/SUBSTR退出比賽試過,但我得到一個有點失去了邏輯:

function findSomeTag($str) { 
    $result = []; 
    $startTag = "<some:tag>"; 
    $endTag = "</some:tag>"; 
    $offset = 0; 
    $start = strpos($str, $startTag, $offset); 
    while ($start !== false) { 
     $nextStart = strpos($str, $startTag, $start + 1); 
     $nextEnd = strpos($str, $endTag, $start + 1); 
     if ($nextStart === false || $nextEnd < $nextStart) { 
      $result[] = substr($str, $start, $nextEnd - $start + strlen($endTag)); 
     } 
     $start = $nextStart; 
    } 
    return $result; 
} 

(注:上面的功能完全不工作,並可能會無限循環)

回答

1

我不像其他的答案,這個版本將讀取標籤嵌套標籤:

$text = " 
<div> 
    Some content 1 
    </ 
    <some:tag> 
     Section 1 
    </some:tag> 
    <b>Some content 2 
    <some:tag> 
     Section 2 
     <some:tag> 
      Section 3 
     </some:tag> 
    </some:tag> 
    Some content 3 
    </p> 
</div> 
"; 

$parser = new Parser(new TextReader($text)); 
$found = $parser->findTags("<some:tag>", "</some:tag>"); 

class TextReader { 
    private $idx = 0; 
    private $reading; 
    private $lastIdx; 

    public function __construct($reading) { 
     $this->reading = $reading; 
     $this->lastIdx = strlen($reading) - 1; 
    } 

    public function hasMore() { 
     return $this->idx < $this->lastIdx; 
    } 

    public function nextChar() { 
     if(!$this->hasMore()) return null; 

     return $this->reading[$this->idx++]; 
    } 

    public function rewind($howFar) { 
     $this->idx -= $howFar; 
     if($this->idx < 0) $this->idx = 0; 
    } 
} 


class Parser { 
    private $TextReader; 

    public function __construct($TextReader) { 
     $this->TextReader = $TextReader; 
    } 

    public function findTags($startTagName, $endTagName) { 
     $found = array(); 

     while(($next = $this->findNextTag($startTagName, $endTagName)) != null) { 
      $found[] = $next; 
     } 

     return $found; 
    } 

    public function findNextTag($startTagName, $endTagName) { 
     // find the start of our first tag 
     $junk = $this->readForTag($startTagName); 
     if($junk == null) return null; // didn't find another tag 

     $nests = 0; 
     $started = false; 

     $startLength = strlen($startTagName); 
     $endLength = strlen($endTagName); 

     $readSoFar = ""; 

     while($this->TextReader->hasMore()) { 
      // found a start tag 
      if(substr($readSoFar, $readSoFarLength - $startLength) == $startTagName) { 
       $started = true; 
       $nests++; 
      } 

      // found an end tag 
      if(substr($readSoFar, $readSoFarLength - $endLength) == $endTagName) $nests--; 

      $readSoFar .= $this->TextReader->nextChar(); 

      // if we've started, and we found as many starts as ends 
      if($started && $nests == 0) return $readSoFar; 
     } 

     return null; 
    } 

    /* 
    * read the Text Reader until you find a certain tag, and 
    * return what you read before finding the tag, including the tag itself 
    * 
    * Text Reader will be rewound to the beginning of the tag 
    */ 
    private function readForTag($tagName) { 
     $readSoFar = ""; 

     $tagNameLength = strlen($tagName); 

     while($this->TextReader->hasMore()) { 
      // if the last few characters read are the tag 
      if(substr($readSoFar, strlen($readSoFar) - $tagNameLength) == $tagName) { 
       // rewind 
       $this->TextReader->rewind($tagNameLength); 

       // return what we've read 
       return $readSoFar; 
      } 

      $readSoFar .= $this->TextReader->nextChar(); 
     } 

     return null; 
    } 
} 
+0

似乎工作,雖然我結束了這個:http://stackoverflow.com/a/23796360/268074 – Petah

1

要包含嵌套標籤,您可以計算當前打開的標籤的數量。

因此,雖然$nextEnd > $nextStart增加$counter,並且只有當您$nextEnd < $nextStart && $counter == 1(您有一個打開的標記)時添加新的結果。如果$nextEnd < $nextStart && $counter < 1遞減$counter

0

我認爲做任何解析最簡單的方法是使用類似狀態機的東西。基本上你定義了一組狀態和你離開這些狀態並進入其他狀態的條件。

讓我們假設你在某種文本閱讀器中有文本,可以給你下一個字符並向前移動一個指針,還可以將指針倒回一定數量的字符。

然後,你可以創建一個狀態機還挺像這樣(它原來是一個簡單的狀態機,與內本身基本上循環只有一種狀態):

class StateMachine { 
    private $TextReader; 

    public function __construct($TextReader) { 
     $this->TextReader = $TextReader; 
    } 

    public function getTagContents($startTagName, $endTagName) { 
     $tagsFound = array(); 

     // read until we get to the start of a tag 
     while($this->stateReadForTag($startTagName) != null) { 
      // now read until we find the end 
      $contents = $this->stateReadForTag($endTagName); 

      // didn't find the end 
      if($contents == null) break; 

      $tagsFound[] = $contents; 
     } 

     return $tagsFound; 
    } 

    /* 
    * read the Text Reader until you find a certain tag, and 
    * return what you read before finding the tag, including the tag itself 
    * 
    * Text Reader will be rewound to the beginning of the tag 
    */ 
    private function stateReadForTag($tagName) { 
     $readSoFar = ""; 

     $tagNameLength = strlen($tagName); 

     while($this->TextReader->hasMore()) { 
      // if the last few characters read are the tag 
      if(substr($readSoFar, strlen($readSoFar) - $tagNameLength) == $tagName) { 
       // rewind 
       $this->TextReader->rewind($tagNameLength); 

       // return what we've read 
       return $readSoFar; 
      } 

      $readSoFar .= $this->TextReader->nextChar(); 
     } 

     return null; 
    } 
} 

然後調用它像這樣:

$found = $myStateMachine->getTagContents("<some:tag>", "</some:tag>"); 

的TextReader的看起來是這樣的:

class TextReader { 
    private $idx = 0; 
    private $reading; 
    private $lastIdx; 

    public function __construct($reading) { 
     $this->reading = $reading; 
     $this->lastIdx = strlen($reading) - 1; 
    } 

    public function hasMore() { 
     return $this->idx < $this->lastIdx; 
    } 

    public function nextChar() { 
     if(!$this->hasMore()) return null; 

     return $this->reading[$this->idx++]; 
    } 

    public function rewind($howFar) { 
     $this->idx -= $howFar; 
     if($this->idx < 0) $this->idx = 0; 
    } 
} 

然後你會打電話給你的狀態機這樣的:

$myStateMachine = new StateMachine(new TextReader($myXmlFileContents)); 
$found = $myStateMachine->getTagContents("<some:tag>", "</some:tag>"); 
+0

似乎差不多的工作,但它不會在標籤結束後返回內容:http://viper-7.com/yR7xND – Petah

+0

@Petah它會一直讀取,直到找到您要查找的標籤的結尾。哦,我明白了,你想允許其他''在你的原始標籤中。 – Cully

+0

我會添加一些東西來做到這一點。 – Cully

0

結束了與此:

class TagExtractor { 

    public $content; 
    public $tag; 

    public function getTagContent() { 
     $result = []; 
     $startTag = "<{$this->getTag()}>"; 
     $endTag = "</{$this->getTag()}>"; 
     $content = $this->getContent(); 
     $offset = strpos($content, $startTag); 
     while ($offset !== false) { 
      $end = $this->findEnd($content, $offset, $startTag, $endTag); 
      $result[] = substr($content, $offset, $end - $offset); 
      $offset = strpos($content, $startTag, $end); 
     } 
     return $result; 
    } 

    public function findEnd($content, $offset, $startTag, $endTag, $counter = 1) { 
     $offset++; 
     $nextStart = strpos($content, $startTag, $offset); 
     $nextEnd = strpos($content, $endTag, $offset); 
     if ($nextEnd === false) { 
      $counter = 0; 
     } elseif ($nextStart < $nextEnd && $nextStart !== false) { 
      $counter++; 
      $offset = $nextStart; 
     } elseif ($nextEnd < $nextStart || ($nextStart === false && $nextEnd !== false)) { 
      $counter--; 
      $offset = $nextEnd; 
     } 
     if ($counter === 0) { 
      return $offset + strlen($endTag); 
     } 
     return $this->findEnd($content, $offset, $startTag, $endTag, $counter); 
    } 

    // <editor-fold defaultstate="collapsed" desc="Getters and setters"> 
    public function getContent() { 
     return $this->content; 
    } 

    public function setContent($content) { 
     $this->content = $content; 
     return $this; 
    } 

    public function getTag() { 
     return $this->tag; 
    } 

    public function setTag($tag) { 
     $this->tag = $tag; 
     return $this; 
    } 
    // </editor-fold> 
} 
相關問題