將HTML從定義的起點解析到定義的終點？

我有一些HTML：將HTML從定義的起點解析到定義的終點？

<hr noshade> 
<p><a href="#1">Some text here</a></p> 
<p style="margin-top:0pt;margin-bottom:0pt;line-height:120%;"><span style="color:#000000;font-weight:bold;">This is some description</span></p> 
<hr noshade> <!-- so <hr noshade> is the delimiter for me --> 
<p><a href="#2">Some more text here</a></p> 
<p style="margin-top:0pt;margin-bottom:0pt;line-height:120%;"><span style="color:#000000;font-weight:bold;">This is description for some more text</span></p> 
<hr noshade>

在使用引入nokogiri分析，我想每個組的標籤是由我自己的分隔符<hr noshade>分離之間打印信息。因此，第一個塊應在兩個hr noshade標籤之間的所有「p」標籤之間打印信息等等。

來源

2013-09-24 Rohan Dalvi

我使用XPath select all elements between two specific elements

接受的答案我只有一個半safisfactory解決方案

您可以使用此XPath表達式：

.//hr[1][@noshade] 
    /following-sibling::*[not(self::hr[@noshade])] 
         [count(preceding-sibling::hr[@noshade])=1]

爲第一組<hr noshade> 1之間和2，

然後，

.//hr[2][@noshade] 
    /following-sibling::*[not(self::hr[@noshade])] 
         [count(preceding-sibling::hr[@noshade])=2]

爲<hr noshade> 2和3之間的元件等

什麼這些表達式中選擇：

一個<hr noshade>的所有兄弟姐妹，通過其位置N指定
具有n只有<hr noshade>以前兄弟姐妹，即定位在第N組
並且不是<hr noshade>本身

由於它會選擇2 <hr noshade>之間的幾個元素，因此您可能必須循環查找結果併爲每個同級元素提取數據。

任何人在更通用的解決方案？

來源

2013-09-24 17:46:53

感謝您的回覆。是的，這對我來說很有意義。我現在試圖想象一個更通用的解決方案，因爲html文件是由軟件自動生成的，所以我不知道它可能生成的

的數量。 –

所以，我試過這個： path ='// hr [1] [@ noshade]/following-sibling :: * [not（self :: hr [@noshade]）] [count（preceding-sibling :: '['noshade]）= 1]' xpath = doc.xpath（路徑）但是我得到一個錯誤，因爲，「等於」（Nokogiri :: CSS :: SyntaxError）後的「unexpected」]'「 –

CSS :: SyntaxError錯誤？我沒有使用Nokogiri進行測試，只能用Python的'lxml.html' –

將HTML從定義的起點解析到定義的終點？

回答

相關問題