查看標籤之間的HTML文本（python，lxml，urllib，xpath）

我想解析一些html，我想檢索標籤之間的實際html，但是相反，我的代碼給了我我相信的位置要素。查看標籤之間的HTML文本（python，lxml，urllib，xpath）

這裏是我到目前爲止的代碼：

import urllib.request, http.cookiejar 
from lxml import etree 
import io 
site = "http://somewebsite.com" 


cj = http.cookiejar.CookieJar() 
request = urllib.request.Request(site) 
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj)) 
request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0') 
html = etree.HTML(opener.open(request).read()) 

xpath = "//li[1]//cite[1]" 
filtered_html = html.xpath(xpath) 
print(filtered_html)

這裏是一塊的HTML：

<div class="f kv"> 
<cite> 
www. 
<b>hello</b> 
online.com/ 
</cite> 
<span class="vshid"> 
</div>

目前我的代碼返回：

[<Element cite at 0x36a65e8>, <Element cite at 0x36a6510>, <Element cite at 0x36a64c8>]

如何提取引用標籤之間的實際html代碼？如果我將「/ text（）」添加到我的xpath的末尾，它會讓我更接近，但它會遺漏b標籤中的內容。我的最終目標是讓我的代碼給我「www.helloonline.com/」。

謝謝

來源

2012-12-30 JSoothe

'html'，或'text'？你想要'['www。'，'hello'，'online.com /']'嗎？ –

以及我想我必須首先得到的HTML和剝離標籤，但我真的想結合您的結果得到「www.helloonline.com/」 – JSoothe

使用//text()得到從給定位置的所有文本元素：

text = filtered_html.xpath('//text()') 
print ''.join(t.strip() for t in text) # prints "www.helloonline.com/"

來源

2012-12-30 18:25:13

謝謝，這固定它。 – JSoothe

查看標籤之間的HTML文本（python，lxml，urllib，xpath）

回答

相關問題