2013-07-20 72 views
2

在這裏刮新手。我正在使用Scrapy從一個站點獲取大量數據。當我運行該腳本,它工作正常了幾分鐘,但隨後減慢,只是停止,並不斷拋出下面的兩個不同的網址錯誤它正試圖颳去:Scrapy腳本的錯誤

2013-07-20 14:15:17-0700 [billboard_spider] DEBUG: Retrying <GET http://www.billboard.com/charts/1981-01-17/hot-100> (failed 1 times): Getting http://www.billboard.com/charts/1981-01-17/hot-100 took longer than 180 seconds. 

2013-07-20 14:16:56-0700 [billboard_spider] DEBUG: Crawled (502) <GET http://www.billboard.com/charts/1981-01-17/hot-100> (referer: None) 

上述錯誤與堆放不同的URL,我不知道是什麼導致它...

這裏的腳本:

import datetime 
from scrapy.item import Item, Field 
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 


class BillBoardItem(Item): 
    date = Field() 
    song = Field() 
    artist = Field() 


BASE_URL = "http://www.billboard.com/charts/%s/hot-100" 


class BillBoardSpider(BaseSpider): 
    name = "billboard_spider" 
    allowed_domains = ["billboard.com"] 

    def __init__(self): 
     date = datetime.date(year=1975, month=12, day=27) 

     self.start_urls = [] 
     while True: 
      if date.year >= 2013: 
       break 

      self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d')) 
      date += datetime.timedelta(days=7) 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     date = hxs.select('//span[@class="chart_date"]/text()').extract()[0] 

     songs = hxs.select('//div[@class="listing chart_listing"]/article') 
     item = BillBoardItem() 
     item['date'] = date 
     for song in songs: 
      try: 
       track = song.select('.//header/h1/text()').extract()[0] 
       track = track.rstrip() 
       item['song'] = track 
       item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0] 
       break 
      except: 
       continue 

     yield item 
+0

難道是你被禁止? http://doc.scrapy.org/en/0.16/topics/practices.html#bans – Tiago

回答

2

蜘蛛對我的作品和擦傷的數據沒有任何問題。所以,正如@Tiago所假設的那樣,你被禁止了。

在將來閱讀how to avoid getting banned並適當調整scrapy設置。我會開始嘗試增加DOWNLOAD_DELAY並旋轉你的IP。

此外,請考慮切換到使用真正的自動瀏覽器,如selenium

此外,請參閱是否可以從RSS XML提要獲取日期:http://www.billboard.com/rss

希望有所幫助。

+0

alecxe - 謝謝,一如既往的樂於助人! –

+0

不客氣。讓我知道如果你決定的話,你是否需要幫助切換到硒。 – alecxe