2017-02-08 105 views
0

我試圖刮掉每年最高的廣告牌前100名。我有一個文件,一次工作一年,但我希望它爬過所有年份,並收集數據以及。這裏是我當前的代碼:刮多個頁面Scrapy

from scrapy import Spider 
from scrapy.selector import Selector 
from Billboard.items import BillboardItem 
from scrapy.exceptions import CloseSpider 
from scrapy.http import Request 

URL = "http://www.billboard.com/archive/charts/%/hot-100" 

class BillboardSpider(Spider): 
    name = 'Billboard_spider' 
    allowed_urls = ['http://www.billboard.com/'] 
    start_urls = [URL % 1958] 

def _init_(self): 
      self.page_number=1958 

def parse(self, response): 
      print self.page_number 
      print "----------" 

    rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract() 

    for row in rows: 
     IssueDate = Selector(text=row).xpath('//td[1]/a/span/text()').extract() 
     Song = Selector(text=row).xpath('//td[2]/text()').extract() 
     Artist = Selector(text=row).xpath('//td[3]/a/text()').extract() 


     item = BillboardItem() 
     item['IssueDate'] = IssueDate 
     item['Song'] = Song 
     item['Artist'] = Artist 


     yield item 
      self.page_number += 1 
      yield Request(URL % self.page_number) 

但我發現了錯誤: 「:在指數不支持的格式字符 '/'(0x2F)分別41 start_urls = [URL%1958年] ValueError異常」

任何想法?我希望代碼將原來的「URL」鏈接自動更改爲1959年,並且一年到一年地停止查找表格,然後關閉。

回答

3

你得到的錯誤是因爲你沒有使用正確的語法來格式化字符串。您可以看看here以瞭解它的工作原理。 它不會在您的特定情況下,工作的原因是,您的網址是缺少一個「s」:

URL = "http://www.billboard.com/archive/charts/%/hot-100" 

應該

URL = "http://www.billboard.com/archive/charts/%s/hot-100" 

反正它是更好地使用新的風格的字符串格式化:

URL = "http://www.billboard.com/archive/charts/{}/hot-100" 
start_urls = [URL.format(1958)] 

移動的,你的代碼有一些其他問題:

def _init_(self): 
    self.page_number=1958 

,如果你想使用一個init功能,它應該被命名爲__init__(兩個下劃線),因爲你正在擴大Spider,你需要通過*args**kwargs所以你可以調用父類的構造函數:

def __init__(self, *args, **kwargs): 
    super(MySpider, self).__init__(*args, **kwargs) 
    self.page_number = 1958 

這聽起來像你可能會關閉不使用__init__,而是更好地只使用一個列表comprehension生成所有的url從一開始走:

start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year) 
        for year in range(1958, 2017)] 

start_urls然後將這個樣子:

['http://www.billboard.com/archive/charts/1958/hot-100', 
'http://www.billboard.com/archive/charts/1959/hot-100', 
'http://www.billboard.com/archive/charts/1960/hot-100', 
'http://www.billboard.com/archive/charts/1961/hot-100', 
... 
'http://www.billboard.com/archive/charts/2017/hot-100'] 

你還沒有正確填充你的BillboardItem,因爲對象不(默認)支持項目分配:

item = BillboardItem() 
item['IssueDate'] = IssueDate 
item['Song'] = Song 
item['Artist'] = Artist 

應該是:

item = BillboardItem() 
item.IssueDate = IssueDate 
item.Song = Song 
item.Artist = Artist 

雖然通常在類的初始化函數中做到這一點比較好: class BillboardItem(object): def init(self,issue_date,song,artist): self.issue_date = issue_date self.song = song self。藝術家=藝術家 ,然後創建由item = BillboardItem(IssueDate, Song, Artist)

更新

反正這個項目,我清理你的代碼(並創造了BillboardItem,因爲我不知道究竟你的樣子):

from scrapy import Spider, Item, Field 
from scrapy.selector import Selector 
from scrapy.exceptions import CloseSpider 
from scrapy.http import Request 


class BillboardItem(Item): 
    issue_date = Field() 
    song = Field() 
    artist = Field() 


class BillboardSpider(Spider): 
    name = 'billboard' 
    allowed_urls = ['http://www.billboard.com/'] 
    start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year) 
      for year in range(1958, 2017)] 


    def parse(self, response): 
     print(response.url) 
     print("----------") 

     rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract() 

     for row in rows: 
      issue_date = Selector(text=row).xpath('//td[1]/a/span/text()').extract() 
      song = Selector(text=row).xpath('//td[2]/text()').extract() 
      artist = Selector(text=row).xpath('//td[3]/a/text()').extract() 

      item = BillboardItem(issue_date=issue_date, song=song, artist=artist) 

      yield item 

希望這有助於。 :)

+0

謝謝!將嘗試這ASAP – DataScienceAmateur

+0

我得到這個錯誤:錯誤:蜘蛛必須返回請求,BaseItem,字典或無,得到'BillboardItem'在 DataScienceAmateur

+1

是的,我修復了代碼,以便BillboardItem是一個實際的'Scrapy.Item'而不是隻是一個對象。現在應該工作。 – sxn