使用Python/urllib/beautifulsoup從URL批量下載文本和圖像？

我一直在這裏瀏覽幾篇文章，但我無法讓我的腦袋用Python從給定的URL批量下載圖像和文本。使用Python/urllib/beautifulsoup從URL批量下載文本和圖像？

import urllib,urllib2 
import urlparse 
from BeautifulSoup import BeautifulSoup 
import os, sys 

def getAllImages(url): 
    query = urllib2.Request(url) 
    user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" 
    query.add_header("User-Agent", user_agent) 

    page = BeautifulSoup(urllib2.urlopen(query)) 
    for div in page.findAll("div", {"class": "thumbnail"}): 
     print "found thumbnail" 
     for img in div.findAll("img"): 
      print "found image" 
      src = img["src"] 
      if src: 
       src = absolutize(src, pageurl) 
       f = open(src,'wb') 
       f.write(urllib.urlopen(src).read()) 
       f.close() 
     for h5 in div.findAll("h5"): 
      print "found Headline" 
      value = (h5.contents[0]) 
      print >> headlines.txt, value 


def main(): 
    getAllImages("http://www.nytimes.com/")

以上是現在一些更新的代碼。發生什麼事，什麼都沒有。代碼沒有找到任何與縮略圖的div，顯然，沒有任何結果的打印....所以可能我錯過了一些指向包含圖像和標題的正確divs？

非常感謝！

來源

2011-10-27 birgit

如果您可以解釋您在嘗試下載文件時遇到的具體問題，可能會得到更詳細的答案。你讀過像http://stackoverflow.com/questions/3042757/downloading-a-picture-via-urllib-and-python，其中包含代碼下載圖片在他們的答案嗎？ – Martey

您正在使用的操作系統不知道如何寫入您在src中傳遞的文件路徑。確保名稱使用將文件保存到磁盤是一個操作系統實際上可以使用：

src = "abc.com/alpha/beta/charlie.jpg" 
with open(src, "wb") as f: 
    # IOError - cannot open file abc.com/alpha/beta/charlie.jpg 

src = "alpha/beta/charlie.jpg" 
os.makedirs(os.path.dirname(src)) 
with open(src, "wb" as f: 
    # Golden - write file here

，一切都將開始工作。

了一些額外的想法：

確保正常化的保存文件路徑（例如os.path.join(some_root_dir, *relative_file_path*)。） - 這取決於他們src，否則你會被寫入圖像都在你的硬盤驅動器。
除非您正在運行某種測試，否則宣傳您是user_agent字符串中的一個bot並尊重robots.txt文件（或者提供某種聯繫信息以便人們可以要求您在需要時停止）。

來源

2011-10-27 16:54:32

非常感謝您的快速回復，不幸的是，在更改了這一行後，仍然沒有任何結果。運行代碼只會導致沒有任何東西.... :( – birgit

回溯（最近呼叫最後）：文件「test.py」，第40行，在 main（）文件「test.py」，第35行，在主 call = getAllImages（「http://www.nytimes.com/」）文件「test.py」，第21行，在getAllImages f = open（src，'wb'） IOError：[Errno 2 ]沒有這樣的文件或目錄：u'http：//i1.nyt.com/images/2011/10/27/us/cain1/cain1-thumbStandard.jpg'.....這是正常化的點該部分發揮作用！？ – birgit

你真的沒有http：//嗎？ – joeforker

使用Python/urllib/beautifulsoup從URL批量下載文本和圖像？

回答

相關問題