2013-04-20 22 views
0

我試圖下載特定維基百科頁面的所有圖像。這裏是代碼片段通過python腳本從維基百科下載圖片時出錯

from bs4 import BeautifulSoup as bs 
import urllib2 
import urlparse 
from urllib import urlretrieve 

site="http://en.wikipedia.org/wiki/Pune" 
hdr= {'User-Agent': 'Mozilla/5.0'} 
outpath="" 
req = urllib2.Request(site,headers=hdr) 
page = urllib2.urlopen(req) 
soup =bs(page) 
tag_image=soup.findAll("img") 
for image in tag_image: 
     print "Image: %(src)s" % image 
     urlretrieve(image["src"], "/home/mayank/Desktop/test") 

在運行程序後,我看到的錯誤與下面的堆棧

Image: //upload.wikimedia.org/wikipedia/commons/thumb/0/04/Pune_Montage.JPG/250px-Pune_Montage.JPG 
Traceback (most recent call last): 
    File "download_images.py", line 15, in <module> 
    urlretrieve(image["src"], "/home/mayank/Desktop/test") 
    File "/usr/lib/python2.7/urllib.py", line 93, in urlretrieve 
    return _urlopener.retrieve(url, filename, reporthook, data) 
    File "/usr/lib/python2.7/urllib.py", line 239, in retrieve 
    fp = self.open(url, data) 
    File "/usr/lib/python2.7/urllib.py", line 207, in open 
    return getattr(self, name)(url) 
    File "/usr/lib/python2.7/urllib.py", line 460, in open_file 
    return self.open_ftp(url) 
    File "/usr/lib/python2.7/urllib.py", line 543, in open_ftp 
    ftpwrapper(user, passwd, host, port, dirs) 
    File "/usr/lib/python2.7/urllib.py", line 864, in __init__ 
    self.init() 
    File "/usr/lib/python2.7/urllib.py", line 870, in init 
    self.ftp.connect(self.host, self.port, self.timeout) 
    File "/usr/lib/python2.7/ftplib.py", line 132, in connect 
    self.sock = socket.create_connection((self.host, self.port), self.timeout) 
    File "/usr/lib/python2.7/socket.py", line 571, in create_connection 
    raise err 
IOError: [Errno ftp error] [Errno 111] Connection refused 

請是什麼原因造成這個錯誤幫助嗎?

回答

1

//是當前協議的簡寫。這似乎是維基百科是使用簡寫,所以你必須明確指定HTTP,而不是FTP(其中Python是假設出於某種原因):

for image in tag_image: 
    src = 'http:' + image 
+0

感謝@Blender:這解決了我的問題 但是我只是想補充一件事,這樣如果任何人提到這個問題,他不會誤導。追加http和圖片贏得了;按照答案中提到的那樣工作。相反,我這樣做: urlretrieve('http:'+ image [「src」],outpath) – 2013-04-20 09:13:45