lxml.html通過搜索關鍵字

我有一個像下面lxml.html通過搜索關鍵字

<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>

我想要得到的字符串「：文本關鍵字」 HTML的部分中提取的字符串。

我知道我可以使用Chrome檢查或FF螢火蟲，然後選擇（的XPath）.extract（）獲得上述的HTML的XPath，然後剝離HTML標籤得到的字符串。但是，由於xpath在不同的頁面之間不一致，因此該方法不夠通用。

因此，我在下面的方法思考：首先，搜索關鍵字「：」使用（代碼是用以scrapy HtmlXPathSelector，因爲我不知道如何做相同的lxml.html）

hxs = HtmlXPathSelector(response) 
hxs.select('//*[contains(text(), "The Keyword:")]')

什麼時候pprint我得到一些回報：

>>> pprint(hxs.select('//*[contains(text(), "The Keyword:")]')) 
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>

我的問題是如何獲得期望字符串：「這個關鍵字：文本」。我正在考慮如何確定xpath，如果xpath已知，那麼我當然可以得到想要的字符串。

我願意接受比其他lxml.html任何解決方案。

謝謝。

來源

2012-12-22 learnJQueryUI

http://stackoverflow.com/questions/14004439/scrapy-htmlxpathselector-determine-xpath-by-searching-for-keyword的近重複。這兩個問題應該合併？ – Talvalin

from lxml import html 

s = '<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>' 

tree = html.fromstring(s) 
text = tree.text_content() 
print text

來源

2012-12-22 16:53:04

問題是我有完整的html，我沒有s。 – learnJQueryUI

可以稍微修改的XPath與當前的結構工作 - 通過獲取標籤的父母，然後回頭對拳頭a元素，並採取從文本...

>>> tree.xpath('//*[contains(text(), "The Keyword:")]/..//a/text()') 
['The text']

但這可能還不夠靈活......

來源

2012-12-22 16:54:11

lxml.html通過搜索關鍵字

回答

相關問題