lxml - 難度解析stackexchange rss feed

我在解析python中的stackexchange的rss訂閱源時出現問題。當我試圖讓摘要節點，一個空列表返回

我一直在試圖解決這個問題，但不能左右我的頭。

任何人都可以幫忙嗎？感謝一個

In [3o]: import lxml.etree, urllib2

 In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' 

In [32]: cooking_content = urllib2.urlopen(url_cooking) 

In [33]: cooking_parsed = lxml.etree.parse(cooking_content) 

In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary') 

In [35]: cooking_texts 
Out[35]: []

來源

2012-02-23 MrCastro

看看

import lxml.html, lxml.etree 

url_cooking = 'http://cooking.stackexchange.com/feeds' 

#lxml.etree version 
data = lxml.etree.parse(url_cooking) 
summary_nodes = data.xpath('.//feed/entry/summary') 
print('Found ' + str(len(summary_nodes)) + ' summary nodes') 

#lxml.html version 
data = lxml.html.parse(url_cooking) 
summary_nodes = data.xpath('.//feed/entry/summary') 
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

這兩個版本當你發現，第二個版本不返回任何節點，但lxml.html版本工作正常。 etree版本不起作用，因爲它期望名稱空間，並且html版本正在工作，因爲它忽略了名稱空間。部分下降http://lxml.de/lxmlhtml.html，它說：「HTML解析器明顯忽略名稱空間和一些其他XMLisms。」

注意當您打印etree版本的根節點（print(data.getroot())）時，您會得到類似於<Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>的內容。這意味着它是一個名稱空間爲http://www.w3.org/2005/Atom的提要元素。這是一個糾正版的etree代碼。

import lxml.html, lxml.etree 

url_cooking = 'http://cooking.stackexchange.com/feeds' 

ns = 'http://www.w3.org/2005/Atom' 
ns_map = {'ns': ns} 

data = lxml.etree.parse(url_cooking) 
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map) 
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

來源

2012-02-23 08:38:44 gfortune

'data.xpath（ '// NS：進料/ NS：進入/ NS：摘要'，命名空間= { '納秒'：「HTTP： //www.w3.org/2005/Atom'}）' – reclosedev 2012-02-23 08:46:31

gah，難怪！看起來像api在某個時候重命名了'namespaces'關鍵字。用工作代碼更新我的示例。 – gfortune 2012-02-23 09:05:55

非常感謝你的到來。在開始解析之前，我將開始檢查根目錄。 – MrCastro 2012-02-23 09:31:31

嘗試使用BeautifulStoneSoup從beautifulsoup進口。它可能會訣竅。

來源

2012-02-23 08:35:12 user850498

問題是命名空間。

運行以下命令：

cooking_parsed.getroot().tag

，你會看到，如果你導航到飼料條目之一的元素的命名空間

{http://www.w3.org/2005/Atom}feed

同樣。

這意味着在LXML右XPath是：

print cooking_parsed.xpath(
    "//a:feed/a:entry", 
    namespaces={ 'a':'http://www.w3.org/2005/Atom' })

來源

2012-02-23 08:50:10

不知何故，我懷疑這個答案對你來說比對我更容易。 ;）Sheepishly碰到你的答案，並隨時指出我在我的任何錯誤。 – gfortune 2012-02-23 09:14:37

lxml - 難度解析stackexchange rss feed

回答

相關問題