2012-04-26 45 views
2

如何讓下面的腳本一次下載多個鏈接而不是一次一個地使用urllib2?python urllib2多種下載

蟒蛇:

from BeautifulSoup import BeautifulSoup 
import lxml.html as html 
import urlparse 
import os, sys 
import urllib2 
import re 

print ("downloading and parsing Bibles...") 
root = html.parse(open('links.html')) 
for link in root.findall('//a'): 
    url = link.get('href') 
    name = urlparse.urlparse(url).path.split('/')[-1] 
    dirname = urlparse.urlparse(url).path.split('.')[-1] 
    f = urllib2.urlopen(url) 
    s = f.read() 
    if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname) 
    soup = BeautifulSoup(s) 
    articleTag = soup.html.body.article 
    converted = str(articleTag) 
    full_path = os.path.join(dirname, name) 
    open(full_path, 'w').write(converted) 
    print(name) 
print("DOWNLOADS COMPLETE!") 

links.html

<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.5.nmv-fas">http://www.youversion.com/bible/gen.5.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.6.nmv-fas">http://www.youversion.com/bible/gen.6.nmv-fas</a> 
+0

你嘗試過什麼? [這是一個出發點](http://docs.python.org/library/threading.html#thread-objects)。和[類似的問題](http://stackoverflow.com/questions/4131069/need-some-assistance-with-python-threading-queue)。 – AdamKG 2012-04-26 16:56:14

+0

我意識到你問了urllib,但你可能想看看scrapy。它是非常成熟的異步它可以讓你用很少的努力做出多個請求 – dm03514 2012-04-26 17:27:34

回答

1

Blainer,嘗試穿線。

這裏有一個很好的實際例子

http://www.ibm.com/developerworks/aix/library/au-threadingpython/

然後引用蟒蛇STD庫以及

http://docs.python.org/library/threading.html

如果你看看在實際例子,實際上具有的線程版本的樣本urllib2併發下載。我我繼續帶你幾個步驟更進的過程中,你將有,說解決這個問題,以進一步解析您的HTML出部分工作..

#!/usr/bin/env python 

import Queue 
import threading 
import urllib2 
import time 
import htmllib, formatter 

class LinksExtractor(htmllib.HTMLParser): 
    # derive new HTML parser 

    def __init__(self, formatter):   
     # class constructor 
     htmllib.HTMLParser.__init__(self, formatter) 
     # base class constructor 
     self.links = []   
     # create an empty list for storing hyperlinks 

    def start_a(self, attrs) : # override handler of <A ...>...</A> tags 
     # process the attributes 
     if len(attrs) > 0 : 
      for attr in attrs : 
       if attr[0] == "href":   
        # ignore all non HREF attributes 
        self.links.append(attr[1]) # save the link info in the list 

    def get_links(self) :  
     # return the list of extracted links 
     return self.links 

format = formatter.NullFormatter() 
htmlparser = LinksExtractor(format) 

data = open("links.html") 
htmlparser.feed(data.read()) 
htmlparser.close() 

hosts = htmlparser.links 

queue = Queue.Queue() 

class ThreadUrl(threading.Thread): 
    """Threaded Url Grab""" 
    def __init__(self, queue): 
     threading.Thread.__init__(self) 
     self.queue = queue 

    def run(self): 
     while True: 
      #grabs host from queue 
      host = self.queue.get() 

      #################################### 
      ############FIX THIS PART########### 
      #VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV# 

      url = urllib2.urlopen(host) 
      morehtml = url.read() # your own your own with this 

      #signals to queue job is done 
      self.queue.task_done() 

start = time.time() 
def main(): 
    #spawn a pool of threads, and pass them queue instance 
    for i in range(5): 
     t = ThreadUrl(queue) 
     t.setDaemon(True) 
     t.start() 

     #populate queue with data 
    for host in hosts: 
     queue.put(host) 

    #wait on the queue until everything has been processed  
    queue.join() 

main() 
print "Elapsed Time: %s" % (time.time() - start) 
+0

我看了一下,但我的腳本只抓取一個網址,一次從links.html ...我怎麼能讓變量「網址「立即獲取所有鏈接? – Blainer 2012-04-26 17:03:32

+0

在這裏更新了答案 – dc5553 2012-04-26 17:12:06

+2

解析腳本中較高的html並創建一個列表,然後在其下載時間釋放線程。 (我要說釋放地獄,但你下載聖經!哈哈) – dc5553 2012-04-26 17:18:39