下載的圖像是空白圖像，而不是實際圖像

爲了學習的目的，我試圖下載Buzzfeed文章的所有帖子圖像。下載的圖像是空白圖像，而不是實際圖像

這裏是我的代碼：

import lxml.html 
import string 
import random 
import requests 

url ='http://www.buzzfeed.com/mjs538/messages-from-creationists-to-people-who-believe-in-evolutio?bftw' 
headers = headers = { 
     'User-Agent': 'Mozilla/5.0', 
     'From': '[email protected]' 
} 

page= requests.get(url) 

tree = lxml.html.fromstring(page.content) 

#print(soup.prettify()).encode('ascii', 'ignore') 

images = tree.cssselect("div.sub_buzz_content img") 

def id_generator(size=6, chars=string.ascii_uppercase + string.digits): 
    return ''.join(random.choice(chars) for x in range(size)) 

for image in images: 
    with open(id_generator() + '.jpg', 'wb') as handle: 
     request = requests.get(image.attrib['src'], headers=headers, stream=True) 

     for block in request.iter_content(1024): 
      if not block: 
       break 
      handle.write(block)

什麼被檢索的圖像的所有110個字節大小，並查看他們只是一個空的圖像。我在我的代碼中做錯了什麼，導致了這個問題？如果有更簡單的方法來執行此操作，我不必使用請求。

來源

2014-02-07 ComputerLocus

嘗試添加一個用戶代理到您的請求。許多網絡服務器拒絕沒有用戶代理的請求。通常在抓取時在用戶代理中留下一個電子郵件地址，以便讓服務器所有者在您不批准抓取時與您聯繫。 –

@SteinarLima仍然沒有添加用戶代理的運氣。我用新代碼更新了OP。我相信我正確實施了用戶代理？ – ComputerLocus

另一個說明：您不應該將這些圖像保存在您的計算機上。他們會讓你看起來很愚蠢。 –

如果你試圖抓取網頁的源代碼，仔細觀察，你會發現圖像的URL要在img標籤的src屬性都沒有規定，但在rel:bf_image_src屬性。

將image.attrib['src']更改爲image.attrib['rel:bf_image_src']應該可以解決您的問題。

我沒能複製你的代碼（它聲稱cssselect未安裝），但是這個代碼順利在我的電腦上BeautifulSoup和urllib2運行，並下載所有22幅圖片。

from itertools import count 
from bs4 import BeautifulSoup 
import urllib2 
from time import sleep 


url ='http://www.buzzfeed.com/mjs538/messages-from-creationists-to-people-who-believe-in-evolutio?bftw' 
headers = { 
    'User-Agent': 'Non-commercical crawler, Steinar Lima. Contact: https://stackoverflow.com/questions/21616904/images-downloaded-are-blank-images-instead-of-actual-images' 
} 

r = urllib2.Request(url, headers=headers) 
soup = BeautifulSoup(urllib2.urlopen(r)) 
c = count() 

for div in soup.find_all('div', id='buzz_sub_buzz'): 
    for img in div.find_all('img'): 
     print img['rel:bf_image_src'] 
     with open('images/{}.jpg'.format(next(c)), 'wb') as img_out: 
      req = urllib2.Request(img['rel:bf_image_src'], headers=headers) 
      img_out.write(urllib2.urlopen(req).read()) 
      sleep(5)

來源

2014-02-07 01:13:27

我想知道這個表示法是什麼意思：'圖像/ {}。jpg' – ComputerLocus

@Fogest這是用[str.format]（http：//docs.python。組織/ 2 /庫/ stdtypes.html＃str.format）。我使用'c'作爲['itertools.count']（http://docs.python.org/2/library/itertools.html#itertools.count），並使用''images/{}。jpg' .format（next（c））'，文件名將從'0.jpg'開始並向上計數。 –

啊好吧，這是有道理的。 count（）基本上等價於將整數設置爲0，然後每次循環運行時遞增？如果是這樣，使用'count（）'是否有優勢？使用遞增數字可能比使用像我這樣的隨機字符串更好。 – ComputerLocus

下載的圖像是空白圖像，而不是實際圖像

回答

相關問題