Python，BeautifulSoup或LXML - 使用CSS標記從HTML解析圖像URL

我已經搜索了高和低的一個體面的解釋如何BeautifulSoup或LXML的工作。當然，他們的文檔非常棒，但對於像我這樣的人來說，這是一個python /編程新手，很難破解我正在尋找的東西。Python，BeautifulSoup或LXML - 使用CSS標記從HTML解析圖像URL

無論如何，作爲我的第一個項目，我正在使用Python解析RSS feed以獲得發佈鏈接 - 我已經使用Feedparser完成了此操作。我的計劃是，然後刮每個帖子的圖像。但對於我的生活，我無法弄清楚如何讓BeautifulSoup或LXML做我想做的！我花了幾個小時閱讀文檔，並使用谷歌搜索無濟於事，所以我在這裏。以下是大圖（我的刮臉）的一行。

<div class="bpBoth"><a name="photo2"></a><img src="http://inapcache.boston.com/universal/site_graphics/blogs/bigpicture/shanghaifire_11_22/s02_25947507.jpg" class="bpImage" style="height:1393px;width:990px" /><br/><div onclick="this.style.display='none'" class="noimghide" style="margin-top:-1393px;height:1393px;width:990px"></div><div class="bpCaption"><div class="photoNum"><a href="#photo2">2</a></div>In this photo released by China's Xinhua news agency, spectators watch an apartment building on fire in the downtown area of Shanghai on Monday Nov. 15, 2010. (AP Photo/Xinhua) <a href="#photo2">#</a><div class="cf"></div></div></div>

所以，根據我的文檔的理解，我應該能夠通過以下：

soup.find("a", { "class" : "bpImage" })

要查找與CSS類的所有實例。那麼，它不會返回任何東西。我確信我忽略了一些微不足道的東西，所以我非常感謝你的耐心。

非常感謝您的回覆！

對於未來的Google，我會包括我feedparser代碼：

#! /usr/bin/python 

# RSS Feed Parser for the Big Picture Blog 

# Import applicable libraries 

import feedparser 

#Import Feed for Parsing 
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index") 

# Print feed name 
print d['feed']['title'] 

# Determine number of posts and set range maximum 
posts = len(d['entries']) 

# Collect Post URLs 
pointer = 0 
while pointer < posts: 
    e = d.entries[pointer] 
    print e.link 
    pointer = pointer + 1

來源

2010-11-23 tylerdavis

使用lxml的，你可能會做這樣的事情：

import feedparser 
import lxml.html as lh 
import urllib2 

#Import Feed for Parsing 
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index") 

# Print feed name 
print d['feed']['title'] 

# Determine number of posts and set range maximum 
posts = len(d['entries']) 

# Collect Post URLs 
for post in d['entries']: 
    link=post['link'] 
    print('Parsing {0}'.format(link)) 
    doc=lh.parse(urllib2.urlopen(link)) 
    imgs=doc.xpath('//img[@class="bpImage"]') 
    for img in imgs: 
     print(img.attrib['src'])

來源

2010-11-23 17:22:38 unutbu

這是完美的。非常感謝你。 – tylerdavis 2010-11-23 17:28:12

您發佈查找具有bpImage類的所有a元素的代碼。但是，您的示例在img元素上有bpImage類，而不是a。你只需要做到：

soup.find("img", { "class" : "bpImage" })

來源

2010-11-23 17:05:01

哈哈。當然。這樣就會返回帶有標籤的網址。有沒有什麼方法可以將這些內容剝離到只有url？ – tylerdavis 2010-11-23 17:10:17

使用pyparsing搜索標籤是相當直觀：

from pyparsing import makeHTMLTags, withAttribute 

imgTag,notused = makeHTMLTags('img') 

# only retrieve <img> tags with class='bpImage' 
imgTag.setParseAction(withAttribute(**{'class':'bpImage'})) 

for img in imgTag.searchString(html): 
    print img.src

來源

2010-11-23 18:54:34 PaulMcG

Python，BeautifulSoup或LXML - 使用CSS標記從HTML解析圖像URL

回答

相關問題