2013-07-16 68 views
3

這裏是新手。我使用urllib2編寫了一個簡單的腳本來瀏覽Billboard.com,並從1958年到2013年每週爲首歌和歌手刮掉。問題是速度很慢 - 完成需要幾個小時。如何更有效地利用Urllib2進行刮擦?

我想知道瓶頸在哪裏,如果有一種方法可以更有效地利用Urllib2或者如果我需要使用更復雜的工具?

import re 
import urllib2 
array = [] 
url = 'http://www.billboard.com/charts/1958-08-09/hot-100' 
date = "" 
while date != '2013-07-13': 
    response = urllib2.urlopen(url) 
    htmlText = response.read() 
    date = re.findall('\d\d\d\d-\d\d-\d\d',url)[0] 
    song = re.findall('<h1>.*</h1>', htmlText)[0] 
    song = song[4:-5] 
    artist = re.findall('/artist.*</a>', htmlText)[1] 
    artist = re.findall('>.*<', artist)[0] 
    artist = artist[1:-1] 
    nextWeek = re.findall('href.*>Next', htmlText)[0] 
    nextWeek = nextWeek[5:-5] 
    array.append([date, song, artist]) 
    url = 'http://www.billboard.com' + nextWeek 
print array 
+0

[Scrapy(https://scrapy.readthedocs.org/)將執行好多了,它的該工作的工具,當然。讓我知道你是否可以 - 我會爲你寫一個蜘蛛樣本。 – alecxe

+0

改進將包括不使用urllib2,不使用正則表達式來解析html,並使用多個線程來執行您的I/O。 – roippi

+0

我真誠懷疑'urllib2'與任何效率問題都有關係。它所做的只是發出請求並拉下回應;有99.99%的時間是網絡時間,沒有其他方法可以改善。問題是(a)你的解析代碼可能會很慢,(b)你可能會做很多重複或不必要的下載,(c)你需要並行下載(你可以用'urllib2' (d)您需要更快的網絡連接,或者(e)billboard.com正在限制您的工作。 – abarnert

回答

2

下面是使用Scrapy的解決方案。看看在overview,你就會明白,它是專爲這種任務的工具:

  • 它正迅速(基於雙絞線)
  • 易於使用和理解
  • 建-in提取基於XPath的機制(可以使用bslxml太雖然)
  • 內置支持流水線提取的項目數據庫,XML,JSON無論
  • 和更多的功能

這裏的工作蜘蛛,提取你問的一切(15分鐘的工作對我,而老,筆記本電腦):

import datetime 
from scrapy.item import Item, Field 
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 


class BillBoardItem(Item): 
    date = Field() 
    song = Field() 
    artist = Field() 


BASE_URL = "http://www.billboard.com/charts/%s/hot-100" 


class BillBoardSpider(BaseSpider): 
    name = "billboard_spider" 
    allowed_domains = ["billboard.com"] 

    def __init__(self): 
     date = datetime.date(year=1958, month=8, day=9) 

     self.start_urls = [] 
     while True: 
      if date.year >= 2013: 
       break 

      self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d')) 
      date += datetime.timedelta(days=7) 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     date = hxs.select('//span[@class="chart_date"]/text()').extract()[0] 

     songs = hxs.select('//div[@class="listing chart_listing"]/article') 
     for song in songs: 
      item = BillBoardItem() 
      item['date'] = date 
      try: 
       item['song'] = song.select('.//header/h1/text()').extract()[0] 
       item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0] 
      except: 
       continue 

      yield item 

保存成billboard.py,並通過scrapy runspider billboard.py -o output.json運行。然後,在output.json你會看到:

... 
{"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"} 
{"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"} 
{"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"} 
{"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"} 
{"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"} 
{"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"} 
{"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"} 
{"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"} 
{"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"} 
{"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"} 
{"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"} 
{"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"} 
{"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"} 
{"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"} 
{"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"} 
{"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"} 
... 

而且,看看grequests作爲替代工具。

希望有所幫助。

+0

@ Ben321考慮接受答案,如果它值得,謝謝。 – alecxe

2

您的瓶頸幾乎肯定是從網站獲取數據。每個網絡請求都有延遲,這會阻止其他任何事情同時發生。您應該考慮跨多個線程分割請求,以便一次發送多個請求。基本上,你的性能是I/O綁定的,而不是CPU綁定的。

這是一個從頭開始構建的簡單解決方案,以便您可以看到爬行程序通常如何工作。從長遠來看,使用類似Scrapy的東西可能是最好的,但我發現從簡單明瞭開始總是有幫助的。

import threading 
import Queue 
import time 
import datetime 
import urllib2 
import re 

class Crawler(threading.Thread): 
    def __init__(self, thread_id, queue): 
     threading.Thread.__init__(self) 
     self.thread_id = thread_id 
     self.queue = queue 

     # let's use threading events to tell the thread when to exit 
     self.stop_request = threading.Event() 

    # this is the function which will run when the thread is started 
    def run(self): 
     print 'Hello from thread %d! Starting crawling...' % self.thread_id 

     while not self.stop_request.isSet(): 
      # main crawl loop 

      try: 
       # attempt to get a url target from the queue 
       url = self.queue.get_nowait() 
      except Queue.Empty: 
       # if there's nothing on the queue, sleep and continue 
       time.sleep(0.01) 
       continue 

      # we got a url, so let's scrape it! 
      response = urllib2.urlopen(url) # might want to consider adding a timeout here 
      htmlText = response.read() 

      # scraping with regex blows. 
      # consider using xpath after parsing the html using lxml.html module 
      song = re.findall('<h1>.*</h1>', htmlText)[0] 
      song = song[4:-5] 
      artist = re.findall('/artist.*</a>', htmlText)[1] 
      artist = re.findall('>.*<', artist)[0] 
      artist = artist[1:-1] 

      print 'thread %d found artist:', (self.thread_id, artist) 

    # we're overriding the default join function for the thread so 
    # that we can make sure it stops 
    def join(self, timeout=None): 
     self.stop_request.set() 
     super(Crawler, self).join(timeout) 

if __name__ == '__main__': 
    # how many threads do you want? more is faster, but too many 
    # might get your IP blocked or even bring down the site (DoS attack) 
    n_threads = 10 

    # use a standard queue object (thread-safe) for communication 
    queue = Queue.Queue() 

    # create our threads 
    threads = [] 
    for i in range(n_threads): 
     threads.append(Crawler(i, queue)) 

    # generate the urls and fill the queue 
    url_template = 'http://www.billboard.com/charts/%s/hot-100' 
    start_date = datetime.datetime(year=1958, month=8, day=9) 
    end_date = datetime.datetime(year=1959, month=9, day=5) 
    delta = datetime.timedelta(weeks=1) 

    week = 0 
    date = start_date + delta*week 
    while date <= end_date: 
     url = url_template % date.strftime('%Y-%m-%d') 
     queue.put(url) 
     week += 1 
     date = start_date + delta*week 

    # start crawling! 
    for t in threads: 
     t.start() 

    # wait until the queue is empty 
    while not queue.empty(): 
     time.sleep(0.01) 

    # kill the threads 
    for t in threads: 
     t.join() 
+0

也許可以更好地解釋使用併發性來提高CPU性能(並行性)和使用併發性來更加徹底地提高數據吞吐量或響應能力(您在此處執行的併發性的種類)之間的差異,這樣做對OP有更深入的理解爲什麼這個工程。 – Wes

+0

非常有幫助和很好的解釋Brendan,謝謝! –

+0

優秀的答案@BrendanWood。找到該隊列的併發性絕對是做到這一點的方法。有50個併發線程(在我的家庭計算機/網絡上進行測試是極限),這大概需要10分鐘。真棒! – w00tw00t111

-1

選項1: 使用Threads向服務器發出「同時」請求。

選項2: 分發到多臺機器工作,最好的解決辦法是使用Storm

+0

async io怎麼樣? – dpn