2015-04-05 118 views
1

我在第一次嘗試Scrapy。在做了一點研究之後,我得到了一些基礎知識。現在我正在嘗試獲取表格的數據。它不工作。檢查下面的源代碼。用Scrapy刮掉表格中的數據

items.py

from scrapy.item import Item, Field 

class Digi(Item): 

    sl = Field() 
    player_name = Field() 
    dismissal_info = Field() 
    bowler_name = Field() 
    runs_scored = Field() 
    balls_faced = Field() 
    minutes_played = Field() 
    fours = Field() 
    sixes = Field() 
    strike_rate = Field() 

digicric.py

from scrapy.spider import Spider 
from scrapy.selector import Selector 
from crawler01.items import Digi 

class DmozSpider(Spider): 
    name = "digicric" 
    allowed_domains = ["digicricket.marssil.com"] 
    start_urls = ["http://digicricket.marssil.com/match/MatchData.aspx?op=2&match=1250"] 

    def parse(self, response): 

     sel = Selector(response) 
     sites = sel.xpath('//*[@id="ctl00_ContentPlaceHolder1_divData"]/table[3]/tr') 
     items = [] 

     for site in sites: 
      item = Digi() 
      item['sl'] = sel.xpath('td/text()').extract() 
      item['player_name'] = sel.xpath('td/a/text()').extract() 
      item['dismissal_info'] = sel.xpath('td/text()').extract() 
      item['bowler_name'] = sel.xpath('td/text()').extract() 
      item['runs_scored'] = sel.xpath('td/text()').extract() 
      item['balls_faced'] = sel.xpath('td/text()').extract() 
      item['minutes_played'] = sel.xpath('td/text()').extract() 
      item['fours'] = sel.xpath('td/text()').extract() 
      item['sixes'] = sel.xpath('td/text()').extract() 
      item['strike_rate'] = sel.xpath('td/text()').extract() 
      items.append(item) 
     return items 

回答

0

的關鍵問題是,你正在使用sel內循環。另一個關鍵問題是,您的XPath表達式指向td元素,而您需要按索引獲取td元素,並將其與item字段相關聯。

工作溶液:

def parse(self, response): 
    sites = response.xpath('//*[@id="ctl00_ContentPlaceHolder1_divData"]/table[3]/tr')[1:-2] 

    for site in sites: 
     item = Digi() 
     item['sl'] = site.xpath('td[1]/text()').extract() 
     item['player_name'] = site.xpath('td[2]/a/text()').extract() 
     item['dismissal_info'] = site.xpath('td[3]/text()').extract() 
     item['bowler_name'] = site.xpath('td[4]/text()').extract() 
     item['runs_scored'] = site.xpath('td[5]/b/text()').extract() 
     item['balls_faced'] = site.xpath('td[6]/text()').extract() 
     item['minutes_played'] = site.xpath('td[7]/text()').extract() 
     item['fours'] = site.xpath('td[8]/text()').extract() 
     item['sixes'] = site.xpath('td[9]/text()').extract() 
     item['strike_rate'] = site.xpath('td[10]/text()').extract() 
     yield item 

它正確地輸出11個實例。

+0

它顯示錯誤。這裏是錯誤屏幕截圖 [error screenshot](http://i.imgur.com/HPh5lia.png) 這裏是代碼: [鏈接](http://i.imgur.com/InxV60O .png) [鏈接](http://i.imgur.com/XtKyOkr.png) – 2015-04-06 06:17:34

+0

@TanzibHossainNirjhor奇怪,爲我工作。您使用的是什麼Scrapy版本? – alecxe 2015-04-06 09:26:51

+0

[Scrapy 0.24.5] [Python 2.7.9] [PIP 6.0.8] [Windows 8.1] – 2015-04-06 16:41:14

1

我只是Scrapy跑了你的代碼,它完美地工作。什麼不適合你?

P.S.這應該是一個評論,但我還沒有足夠的聲譽呢......我會根據需要編輯/關閉答案。

編輯:

我想你應該在每一個循環,而不是return item年底做yield item。其餘的代碼應該沒問題。

下面是來自Scrapy documentaion一個例子:

import scrapy 
from myproject.items import MyItem 

class MySpider(scrapy.Spider): 
    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = [ 
     'http://www.example.com/1.html', 
     'http://www.example.com/2.html', 
     'http://www.example.com/3.html', 
    ] 

    def parse(self, response): 
     for h3 in response.xpath('//h3').extract(): 
      yield MyItem(title=h3) 

     for url in response.xpath('//a/@href').extract(): 
      yield scrapy.Request(url, callback=self.parse) 
+0

問題是發生了循環,但沒有數據被抓取到項目中。 – 2015-04-06 06:01:44