2014-01-22 40 views
0

我有一個用於下載問題內容webcomic的腳本。它看起來好像運行正常,但它下載的文件是空的,只有幾個kb的大小。正在下載Webcomic保存空白文件

#import Web, Reg. Exp, and Operating System libraries 
import urllib, re, os 

#RegExp for the EndNum variable 
RegExp = re.compile('.*<img src="http://www.questionablecontent.net/comics.*') 

#Check the main QC page 
site = urllib.urlopen("http://questionablecontent.net/") 
contentLine = None 

#For each line in the homepage's source... 
for line in site.readlines(): 
    #Break when you find the variable information 
    if RegExp.search(line): 
     contentLine = line 
    break 

#IF the information was found successfuly automatically change EndNum 
#ELSE set it to the latest comic as of this writing 
if contentLine: 
    contentLine = contentLine.split('/') 
    contentLine = contentLine[4].split('.') 
    EndNum = int(contentLine[0]) 
else: 
    EndNum = 2622 

#First and Last comics user wishes to download 
StartNum = 1 
#EndNum = 2622 

#Full path of destination folder needs to pre-exist 
destinationFolder = "D:\Downloads\Comics\Questionable Content" 

#XRange creates an iterator to go over the comics 
for i in xrange(StartNum, EndNum+1): 

    #IF you already have the comic, skip downloading it 
    if os.path.exists(destinationFolder+"\\"+str(i)+".png"): 
     print "Skipping Comic "+str(i)+"..." 
     continue 

    #Printing User-Friendly Messages 
    print "Comic %d Found. Downloading..." % i 

    source = "http://www.questionablecontent.net/comics/"+str(i)+".png" 

    #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs) 
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png")) 

#Graceful program termination 
print str(EndNum-StartNum) + " Comics Downloaded" 

爲什麼它會一直下載空文件?有什麼解決方法嗎?

回答

0

這裏的問題是,如果您的用戶代理沒有設置,服務器不會爲您提供映像。下面是Python 2.7的示例代碼,它應該給你一個關於如何使你的腳本工作的想法。

import urllib2 
import time 

first = 1 
last = 2622 

for i in range(first, last+1): 
    time.sleep(5) # Be nice to the server! And avoid being blocked. 
    for ext in ['png', 'gif']: 
     # Make sure that the img dir exists! If not, the script will throw an 
     # IOError 
     with open('img/{}.{}'.format(i, ext), 'wb') as ifile: 
      try: 
       req = urllib2.Request('http://www.questionablecontent.net/comics/{}.{}'.format(i, ext)) 
       req.add_header('user-agent', 'Mozilla/5.0') 
       ifile.write(urllib2.urlopen(req).read()) 
       break 
      except urllib2.HTTPError: 
       continue 
    else: 
     print 'Could not find image {}'.format(i) 
     continue 
    print 'Downloaded image {}'.format(i) 

你可能想改變你的循環到的東西,就像你的循環(檢查是否像先前已經下載等)。該腳本將嘗試將所有圖像從<start>.<ext>下載到<end>.<ext>,其中<ext>是gif或png。

+0

謝謝,在那裏給它一個旋風,似乎工作得很好。再次感謝! – Ultrin