的Python的XPath只能從根元素

我使用XPath來取消一個網頁獲取的價值，但我有麻煩的代碼的一部分：的Python的XPath只能從根元素

<div class="description"> 
    here's the page description 
    <span> some other text</span> 
    <span> another tag </span> 
</div>

我使用此代碼從要素獲得的價值：

description = tree.xpath('//div[@class="description"]/text()')

我能夠找到正確的div我要找的，但我只想要得到的文本「這裏的頁面描述」不是從內部span標籤

內容任何人都知道我怎樣才能得到的文本根節點，但不是來自子節點的內容？

來源

2016-05-21 Dennis

那xpath表達式不應該包含跨度的內容，只能是div直接子節點的文本節點的內容：'[「\ n這裏是頁面描述\ n」，'\ n'，'\ n']' – mata

您當前使用的表達式實際上僅與頂級文本子節點匹配。你可以把它包裝成normalize-space()清理從額外的新行和空格文本：

>>> from lxml.html import fromstring 
>>> data = """ 
... <div class="description"> 
... here's the page description 
... <span> some other text</span> 
... <span> another tag </span> 
... </div> 
... """ 
>>> root = fromstring(data) 
>>> root.xpath('normalize-space(//div[@class="description"]/text())') 
"here's the page description"

爲了得到包括子節點一個節點的完整文本，使用.text_content()方法：

node = tree.xpath('//div[@class="description"]')[0] 
print(node.text_content())

來源

2016-05-21 20:51:38 alecxe

謝謝，但我認爲我的問題並不清楚，我不想從子節點獲取內容，只能從根節點 – Dennis

@丹尼斯我的不好，但是你應該很好地使用當前擁有的表達方式 - 它只能匹配頂級文本節點.. – alecxe

的Python的XPath只能從根元素

回答

相關問題