使用Scrapy的FormRequest.from_response方法自動抓取下拉菜單明智的數據

我一直在爲這兩天掙扎。我需要從this網站抓取所有「幹部」或類別的數據。不幸的是，該網站允許通過沒有「所有類別」選項的下拉菜單「選擇幹部」來訪問這些數據。爲了避免這種情況，我使用了Scrapy的FormRequest.from_response方法，但是蜘蛛正在返回一個沒有數據的空白文件。任何幫助表示讚賞。下面的代碼：使用Scrapy的FormRequest.from_response方法自動抓取下拉菜單明智的數據

import scrapy 

class IASWinnerSpider(scrapy.Spider): 

    name = 'iaswinner_list' 
    allowed_domains = ['http://civillist.ias.nic.in'] 

    def start_requests(self): 
     urls = [ 'http://civillist.ias.nic.in/UpdateCL/DraftCL.asp' ] 
     for url in urls: 
      yield scrapy.Request(url=url, callback=self.parse) 

    def parse(self, response): 
     return scrapy.FormRequest.from_response(response, method='POST', 
        formdata={'cboCadre': 'UT'}, dont_click=True, callback=self.after_post) 

    def after_post(self, response): 

     table  = response.xpath('/html/body/div/table//tr') 

     for t in table: 

      yield { 
       'serial': t.xpath('td[1]/text()').extract(), 
       'name': t.xpath('td[2]/text()').extract(), 
       'qual': t.xpath('td[3]/text()').extract(), 
       'dob': t.xpath('td[4]/text()').extract(), 
       'post': t.xpath('td[5]/text()').extract(), 
       'rem': t.xpath('td[6]/text()').extract(), 
      }

來源

2017-08-19 Ias Chacha

給出的代碼還不完整（參見[mcve]）。建議增加一個表示該問題的「__main__」部分。 – ederag

如果林哈特的答案滿足你的需求，請不要忘記標記爲'接受'。 –

是的，做到了。謝謝。 –

當我運行你的代碼，我看到這個在日誌中：

2017-08-19 15:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'civillist.ias.nic.in': <POST http://civillist.ias.nic.in/UpdateCL/DraftCL.asp>

只要改變allowed_domains這樣：

allowed_domains = ['civillist.ias.nic.in']

和它的作品。

來源

2017-08-19 13:55:01

使用Scrapy的FormRequest.from_response方法自動抓取下拉菜單明智的數據

回答

相關問題