2015-08-17 77 views
2

這是我第一次嘗試創建一隻蜘蛛,如果我沒有正確完成,請不要吝惜我。 這裏是我試圖從中提取數據的網站的鏈接。 http://www.4icu.org/in/。我想要顯示在頁面上的大學的整個列表。但是當我運行下面的蜘蛛時,我返回一個空的json文件。 我items.pyscrapy蜘蛛沒有返回任何結果

import scrapy 
    class CollegesItem(scrapy.Item): 
    # define the fields for your item here like: 
     link = scrapy.Field() 

這是蜘蛛 colleges.py

import scrapy 
    from scrapy.spider import Spider 
    from scrapy.http import Request 

    class CollegesItem(scrapy.Item): 
    # define the fields for your item here like: 
     link = scrapy.Field() 

    class CollegesSpider(Spider): 
     name = 'colleges' 
     allowed_domains = ["4icu.org"] 
     start_urls = ('http://www.4icu.org/in/',) 

     def parse(self, response): 
      return Request(
       url = "http://www.4icu.org/in/", 
       callback = self.parse_fixtures 
      ) 
     def parse_fixtures(self,response): 
      sel = response.selector 
      for div in sel.css("col span_2_of_2>div>tbody>tr"): 
       item = Fixture() 
       item['university.name'] = tr.xpath('td[@class="i"]/span /a/text()').extract() 
       yield item 
+0

哇,你必須先看看您的代碼中存在一些問題。而且因爲在運行蜘蛛時你沒有得到任何異常,所以你可以放心,你永遠不會到達'parse_fixtures'方法或至少'for'循環。 – GHajba

回答

1

至於在這個問題有一些問題與您的代碼的註釋說明。

首先,您不需要兩種方法 - 因爲在parse方法中,您調用與在start_urls中所做的相同的URL。

您可以通過網站的一些信息,請嘗試使用下面的代碼:

def parse(self, response): 
    for tr in response.xpath('//div[@class="section group"][5]/div[@class="col span_2_of_2"][1]/table//tr'): 
     if tr.xpath(".//td[@class='i']"): 
      name = tr.xpath('./td[1]/a/text()').extract()[0] 
      location = tr.xpath('./td[2]//text()').extract()[0] 
      print name, location 

並調整到您需要填寫您的項目(或項目)。

正如你所看到的,您的瀏覽器顯示在table當你Scrapy刮中不存在額外的tbody。這意味着您經常需要判斷您在瀏覽器中看到的內容。

+0

感謝您的指導,它獲取數據。以下是修改後的代碼和結果。 –

0

這裏是運行命令 蜘蛛

>>scrapy crawl colleges -o mait.json 

繼後的工作代碼

import scrapy 
    from scrapy.spider import Spider 
    from scrapy.http import Request 

    class CollegesItem(scrapy.Item): 
    # define the fields for your item here like: 
     name = scrapy.Field() 
     location = scrapy.Field() 
    class CollegesSpider(Spider): 
     name = 'colleges' 
     allowed_domains = ["4icu.org"] 
     start_urls = ('http://www.4icu.org/in/',) 

     def parse(self, response): 
      for tr in response.xpath('//div[@class="section group"] [5]/div[@class="col span_2_of_2"][1]/table//tr'): 
       if tr.xpath(".//td[@class='i']"): 
        item = CollegesItem() 
        item['name'] = tr.xpath('./td[1]/a/text()').extract()[0] 
        item['location'] = tr.xpath('./td[2]//text()').extract()[0] 
        yield item 

是結果的片段:

[[[[[[[{"name": "Indian Institute of Technology Bombay", "location": "Mumbai"}, 
    {"name": "Indian Institute of Technology Madras", "location": "Chennai"}, 
    {"name": "University of Delhi", "location": "Delhi"}, 
    {"name": "Indian Institute of Technology Kanpur", "location": "Kanpur"}, 
    {"name": "Anna University", "location": "Chennai"}, 
    {"name": "Indian Institute of Technology Delhi", "location": "New Delhi"}, 
    {"name": "Manipal University", "location": "Manipal ..."}, 
    {"name": "Indian Institute of Technology Kharagpur", "location": "Kharagpur"}, 
    {"name": "Indian Institute of Science", "location": "Bangalore"}, 
    {"name": "Panjab University", "location": "Chandigarh"}, 
    {"name": "National Institute of Technology, Tiruchirappalli", "location": "Tiruchirappalli"}, .........