2014-12-28 44 views
0

有沒有辦法使用Xpath來解析兩個標記之間的文本SETS?例如,見例如:PHP DOMDocument獲取兩個標記之間的文本

<div class="par"> 
    <p class="pp"> 
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo. 
    </p> 
</div> 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
    </p> 
</div> 

我想分析通過獲取集SPAN標記之間的文本,以獲得類似下面的數組:

array[0] = "Blah blah blah blah."; 
array[1] = "Yada yada yada yada."; 
array[2] = "Foo foo foo foo."; 
array[3] = "Hmm hmm hmm hmm."; 

我可以使用的DOMDocument簡單地做到這一點?如果不是,那麼實現這一目標的最好方法是什麼?請注意,句子中間可能存在或標記。如:

...<span class="dv">5 </span>Uhh uhh <a href="www.uhh.com">uhh</a> uhh. <span class="dv">6 </span>... 
+0

可能是個好主意,告訴我們「兩套標籤」是否適合您的例子。 –

+0

SPAN標籤組之間。但我確實意識到我想要的最後一段文本不會在BETWEEN兩組之間,就在最後一個span標籤之後... – genechunlee

+1

如果情況總是如此簡單,我認爲您可以使用xpath來獲取子DOMText節點的內容在'p.pp'中。 – prodigitalson

回答

3

UPDATE

看來你沒有希望有一個平坦的列表,以便即時增加這個具體的例子,所以沒有混亂:

$html = '<div class="par"> 
    <p class="pp"> 
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo. 
    </p> 
</div> 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
    </p> 
</div>'; 

$dom = DOMDocument::loadHTML($html); 
$finder = new DOMXPath($dom); 
// select THE TEXT NODES of all p elements with the class pp 
// - note that means its explictly class="pp", 
// not that "pp" is anywhere in the class list you may need to change this up depending... 
// post additional questions for specific xpath help 
$found = $finder->query('//p[@class="pp"]/text()'); 

$nodes = array(); 
// simply transform the resulting DOMNodeList into an array 
// for easier consumption/manipulation 
foreach($found as $textNode) { 
    $node[] = $textNode->nodeValue; 
} 

print_r($nodes); 

產地:

Array 
(
    [0] => 

    [1] => Blah blah blah blah. 
    [2] => Yada 
    yada yada yada. 
    [3] => Foo foo foo foo. 

    [4] => 

    [5] => Hmm hmm hmm hmm. 

) 

如果情況總是這麼簡單,我想你可以使用xpath來獲取p.pp.中的子DOMText節點的內容。

$html = '<div class="par"> 
    <p class="pp"> 
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo. 
    </p> 
</div> 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
    </p> 
</div>'; 

$dom = DOMDocument::loadHTML($html); 
$finder = new DOMXPath($dom); 
// select all p elements with the class pp - note that means its explictly class="pp", 
// not that "pp" is anywhere in the class list you may need to change this up depending... 
// post additional questions for specific xpath help 
$found = $finder->query('//p[@class="pp"]'); 

$nodes = array(); 

foreach($found as $p) { 
    // for each p element, pull its text nodes. 
    $textNodes = $finder->query('text()', $p); 
    $textStr = ''; 
    // loop over the textNodes and concat them into a single string 
    foreach ($textNodes as $n) { 
     $textStr .= $n->nodeValue; 
    } 
    // push the compiled string onto the array 
    $nodes[] = $textStr; 
} 

print_r($nodes); 

這將產生一個結果,如:

Array 
(
    [0] => 
    Blah blah blah blah. Yada 
    yada yada yada. Foo foo foo foo. 

    [1] => 
    Hmm hmm hmm hmm. 

) 

如果你真的希望每個文本節點分開,你只需要改變循環:

foreach($found as $p) { 
    // for each p element, pull its text nodes. 
    $textNodes = $finder->query('text()', $p); 
    $textArr = array(); 
    // loop over the textNodes and concat them into a single string 
    foreach ($textNodes as $n) { 
     $textArr[] = $n->nodeValue; 
    } 
    // push the compiled string onto the array 
    $nodes[] = $textArr; 
} 

,這將給你:

Array 
(
    [0] => Array 
     (
      [0] => 

      [1] => Blah blah blah blah. 
      [2] => Yada 
    yada yada yada. 
      [3] => Foo foo foo foo. 

     ) 

    [1] => Array 
     (
      [0] => 

      [1] => Hmm hmm hmm hmm. 

     ) 

) 

顯然作爲你可以看到它已經抓取了換行符,如果它們不合需要,你可以使用你選擇的數組過濾方法輕鬆地過濾這些換行符。或者你可以查看XPath和DOMDocument設置來調整這一點,IIRC有一些設置處理如何解釋空白(或不),這可能會讓你避免這種情況,但如果你在做其他處理同樣的DOMDocument實例。

+0

這完全是我想要的。非常感謝!!! – genechunlee

+1

還應該指出,這並不是技術上拉扯「兩套標籤之間」的任何東西,只要它們具有相同的父母,就省略了這些標籤的內容。 – prodigitalson

+0

是的,我可以看到。謝謝! – genechunlee

1

你想第一個文本節點是跨度元素之後直接跟隨兄弟:

//span/following-sibling::text()[1] 

這是1:1 PHP語法:

$doc = new DOMDocument(); 
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED); 
$xpath = new DOMXPath($doc); 

$expr = '//span/following-sibling::text()[1]'; 
$result = $xpath->evaluate($expr); 

然後你想要得到的文本節點變成了一串字符串。我想說,當你讓自己的工作已經,在其上運行的一些空白正常化:

$array = array_map(function(DOMText $text) { 
    return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue); 
}, iterator_to_array($result)); 

結果則:

[ 
    "Blah blah blah blah.", 
    "Yada yada yada yada.", 
    "Foo foo foo foo.", 
    "Hmm hmm hmm hmm." 
] 

完整的代碼示例:

<?php 
/** 
* http://stackoverflow.com/questions/27674012/php-domdocument-get-text-between-two-sets-of-tags 
*/ 

$buffer = <<<HTML 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo. 
    </p> 
</div> 
<div class="par"> 
    <p class="pp"> 
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
    </p> 
</div> 
HTML; 

$doc = new DOMDocument(); 
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED); 
$xpath = new DOMXPath($doc); 

$expr = '//span/following-sibling::text()[1]'; 
$result = $xpath->evaluate($expr); 

$array = array_map(function(DOMText $text) { 
    return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue); 
}, iterator_to_array($result)); 

echo json_encode($array, JSON_PRETTY_PRINT); 
相關問題