2016-01-13 22 views
0

我是Scrapy和python的新手。我花了幾個小時嘗試調試並尋找有用的響應,但我仍然陷入困境。我正試圖從www.pro-football-reference.com提取數據。這是我現在所擁有的從未使用Scrapy調用的回調函數

import scrapy 

from nfl_predictor.items import NflPredictorItem 

class NflSpider(scrapy.Spider): 
    name = "nfl2" 
    allowed_domains = ["http://www.pro-football-reference.com/"] 
    start_url = [ 
    "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" 
    ] 

    def parse(self, response): 
     print "parse" 
     for href in response.xpath('// [@id="page_content"]/div[1]/table/tr/td/a/@href'): 
     url = response.urljoin(href.extract()) 
     yield scrapy.Request(url, callback=self.parse_game_content) 

    def parse_game_content(self, response): 
     print "parse_game_content" 
     items = [] 
     for sel in response.xpath('//table[@id = "team_stats"]/tr'): 
      item = NflPredictorItem() 
      item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract() 
      item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract() 
     items.append(item) 
    return items 

我用解析命令進行調試和使用此命令

scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" 

我得到以下輸出

>>> STATUS DEPTH LEVEL 1 <<< 
# Scraped Items ------------------------------------------------------------ 
[] 

# Requests ----------------------------------------------------------------- 
[<GET http://www.pro-football-reference.com/years/2015/games.htm>, 
<GET http://www.nfl.com/scores/2015/REG1>, 
<GET http://www.pro-football-reference.com/boxscores/201509130buf.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130chi.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130crd.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130dal.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130den.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130htx.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130jax.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130nyj.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130rai.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130ram.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130sdg.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130tam.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509130was.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509140atl.htm>, 
<GET http://www.pro-football-reference.com/boxscores/201509140sfo.htm>] 

爲什麼它的代碼正在記錄我想要的鏈接的請求,但它從來不會進入parse_game_content函數來實際地刮取數據?我還測試了parse_game_content函數作爲解析函數,以確保它正在抓取正確的數據,並在此情況下正常工作。

謝謝你的幫助!

+0

你確定你有進口的所有庫? –

回答

0

默認情況下,parse命令獲取給定的URL並使用與--callback選項一起傳遞的方法來解析它,並且解析如果沒有給出。在您的情況下,它只解析解析函數。更改命令給--callback,如:

scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" --callback=parse_game_content 

而且,最好是改變你的parse_game_content功能如下

def parse_game_content(self, response): 
 
     print "parse_game_content" 
 
     for sel in response.xpath('//table[@id="team_stats"]/tr'): 
 
      item = NflPredictorItem() 
 
      item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract() 
 
      item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract() 
 
      yield item