2012-10-14 73 views
4

我想解析這個XML(http://www.reddit.com/r/videos/top/.rss),並有麻煩這樣做。我試圖保存每個項目中的YouTube鏈接,但由於「頻道」子節點而遇到麻煩。我如何達到這個水平,然後我可以遍歷這些項目?如何使用python解析xml提要?

#reddit parse 
reddit_file = urllib2.urlopen('http://www.reddit.com/r/videos/top/.rss') 
#convert to string: 
reddit_data = reddit_file.read() 
#close file because we dont need it anymore: 
reddit_file.close() 

#entire feed 
reddit_root = etree.fromstring(reddit_data) 
channel = reddit_root.findall('{http://purl.org/dc/elements/1.1/}channel') 
print channel 

reddit_feed=[] 
for entry in channel: 
    #get description, url, and thumbnail 
    desc = #not sure how to get this 

    reddit_feed.append([desc]) 

回答

5

您可以嘗試findall('channel/item')

import urllib2 
from xml.etree import ElementTree as etree 
#reddit parse 
reddit_file = urllib2.urlopen('http://www.reddit.com/r/videos/top/.rss') 
#convert to string: 
reddit_data = reddit_file.read() 
print reddit_data 
#close file because we dont need it anymore: 
reddit_file.close() 

#entire feed 
reddit_root = etree.fromstring(reddit_data) 
item = reddit_root.findall('channel/item') 
print item 

reddit_feed=[] 
for entry in item: 
    #get description, url, and thumbnail 
    desc = entry.findtext('description') 
    reddit_feed.append([desc]) 
3

我寫的你使用Xpath表達式(測試成功):

from lxml import etree 
import urllib2 

headers = { 'User-Agent' : 'Mozilla/5.0' } 
req = urllib2.Request('http://www.reddit.com/r/videos/top/.rss', None, headers) 
reddit_file = urllib2.urlopen(req).read() 

reddit = etree.fromstring(reddit_file) 

for item in reddit.xpath('/rss/channel/item'): 
    print "title =", item.xpath("./title/text()")[0] 
    print "description =", item.xpath("./description/text()")[0] 
    print "thumbnail =", item.xpath("./*[local-name()='thumbnail']/@url")[0] 
    print "link =", item.xpath("./link/text()")[0] 
    print "-" * 100