PHP DOMDocument獲取兩個標記之間的文本

有沒有辦法使用Xpath來解析兩個標記之間的文本SETS？例如，見例如：PHP DOMDocument獲取兩個標記之間的文本

<div class="par"> 
    <p class="pp"> 
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo. 
    </p> 
</div> 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
    </p> 
</div>

我想分析通過獲取集SPAN標記之間的文本，以獲得類似下面的數組：

array[0] = "Blah blah blah blah."; 
array[1] = "Yada yada yada yada."; 
array[2] = "Foo foo foo foo."; 
array[3] = "Hmm hmm hmm hmm.";

我可以使用的DOMDocument簡單地做到這一點？如果不是，那麼實現這一目標的最好方法是什麼？請注意，句子中間可能存在或標記。如：

...<span class="dv">5 </span>Uhh uhh <a href="www.uhh.com">uhh</a> uhh. <span class="dv">6 </span>...

來源

2014-12-28 genechunlee

可能是個好主意，告訴我們「兩套標籤」是否適合您的例子。 –

SPAN標籤組之間。但我確實意識到我想要的最後一段文本不會在BETWEEN兩組之間，就在最後一個span標籤之後... – genechunlee

如果情況總是如此簡單，我認爲您可以使用xpath來獲取子DOMText節點的內容在'p.pp'中。 – prodigitalson

UPDATE

看來你沒有希望有一個平坦的列表，以便即時增加這個具體的例子，所以沒有混亂：

$html = '<div class="par"> 
    <p class="pp"> 
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo. 
    </p> 
</div> 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
    </p> 
</div>'; 

$dom = DOMDocument::loadHTML($html); 
$finder = new DOMXPath($dom); 
// select THE TEXT NODES of all p elements with the class pp 
// - note that means its explictly class="pp", 
// not that "pp" is anywhere in the class list you may need to change this up depending... 
// post additional questions for specific xpath help 
$found = $finder->query('//p[@class="pp"]/text()'); 

$nodes = array(); 
// simply transform the resulting DOMNodeList into an array 
// for easier consumption/manipulation 
foreach($found as $textNode) { 
    $node[] = $textNode->nodeValue; 
} 

print_r($nodes);

產地：

Array 
(
    [0] => 

    [1] => Blah blah blah blah. 
    [2] => Yada 
    yada yada yada. 
    [3] => Foo foo foo foo. 

    [4] => 

    [5] => Hmm hmm hmm hmm. 

)

如果情況總是這麼簡單，我想你可以使用xpath來獲取p.pp.中的子DOMText節點的內容。

$html = '<div class="par"> 
    <p class="pp"> 
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo. 
    </p> 
</div> 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
    </p> 
</div>'; 

$dom = DOMDocument::loadHTML($html); 
$finder = new DOMXPath($dom); 
// select all p elements with the class pp - note that means its explictly class="pp", 
// not that "pp" is anywhere in the class list you may need to change this up depending... 
// post additional questions for specific xpath help 
$found = $finder->query('//p[@class="pp"]'); 

$nodes = array(); 

foreach($found as $p) { 
    // for each p element, pull its text nodes. 
    $textNodes = $finder->query('text()', $p); 
    $textStr = ''; 
    // loop over the textNodes and concat them into a single string 
    foreach ($textNodes as $n) { 
     $textStr .= $n->nodeValue; 
    } 
    // push the compiled string onto the array 
    $nodes[] = $textStr; 
} 

print_r($nodes);

這將產生一個結果，如：

Array 
(
    [0] => 
    Blah blah blah blah. Yada 
    yada yada yada. Foo foo foo foo. 

    [1] => 
    Hmm hmm hmm hmm. 

)

如果你真的希望每個文本節點分開，你只需要改變循環：

foreach($found as $p) { 
    // for each p element, pull its text nodes. 
    $textNodes = $finder->query('text()', $p); 
    $textArr = array(); 
    // loop over the textNodes and concat them into a single string 
    foreach ($textNodes as $n) { 
     $textArr[] = $n->nodeValue; 
    } 
    // push the compiled string onto the array 
    $nodes[] = $textArr; 
}

，這將給你：

Array 
(
    [0] => Array 
     (
      [0] => 

      [1] => Blah blah blah blah. 
      [2] => Yada 
    yada yada yada. 
      [3] => Foo foo foo foo. 

     ) 

    [1] => Array 
     (
      [0] => 

      [1] => Hmm hmm hmm hmm. 

     ) 

)

顯然作爲你可以看到它已經抓取了換行符，如果它們不合需要，你可以使用你選擇的數組過濾方法輕鬆地過濾這些換行符。或者你可以查看XPath和DOMDocument設置來調整這一點，IIRC有一些設置處理如何解釋空白（或不），這可能會讓你避免這種情況，但如果你在做其他處理同樣的DOMDocument實例。

來源

2014-12-28 06:16:35 prodigitalson

這完全是我想要的。非常感謝！！！ – genechunlee

還應該指出，這並不是技術上拉扯「兩套標籤之間」的任何東西，只要它們具有相同的父母，就省略了這些標籤的內容。 – prodigitalson

是的，我可以看到。謝謝！ – genechunlee

你想第一個文本節點是跨度元素之後直接跟隨兄弟：

//span/following-sibling::text()[1]

這是1：1 PHP語法：

$doc = new DOMDocument(); 
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED); 
$xpath = new DOMXPath($doc); 

$expr = '//span/following-sibling::text()[1]'; 
$result = $xpath->evaluate($expr);

然後你想要得到的文本節點變成了一串字符串。我想說，當你讓自己的工作已經，在其上運行的一些空白正常化：

$array = array_map(function(DOMText $text) { 
    return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue); 
}, iterator_to_array($result));

結果則：

[ 
    "Blah blah blah blah.", 
    "Yada yada yada yada.", 
    "Foo foo foo foo.", 
    "Hmm hmm hmm hmm." 
]

完整的代碼示例：

<?php 
/** 
* http://stackoverflow.com/questions/27674012/php-domdocument-get-text-between-two-sets-of-tags 
*/ 

$buffer = <<<HTML 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo. 
    </p> 
</div> 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
    </p> 
</div> 
HTML; 

$doc = new DOMDocument(); 
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED); 
$xpath = new DOMXPath($doc); 

$expr = '//span/following-sibling::text()[1]'; 
$result = $xpath->evaluate($expr); 

$array = array_map(function(DOMText $text) { 
    return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue); 
}, iterator_to_array($result)); 

echo json_encode($array, JSON_PRETTY_PRINT);

來源

2015-01-01 19:06:10 hakre

PHP DOMDocument獲取兩個標記之間的文本

回答

相關問題