2016-04-22 48 views
0

我正在迭代地爲單個ID刮兩頁。第一個刮刀適用於所有身份證件,但第二個只適用於一個身份證件。Scrapy颳了一頁'N'次,但在循環中的其他單次時間

class MySpider(scrapy.Spider): 
    name = "scraper" 
    allowed_domains = ["example.com"] 
    start_urls = ['http://example.com/viewData'] 

    def parse(self, response): 
    ids = ['1', '2', '3'] 

    for id in ids: 
     # The following method scraps for all id's 
     yield scrapy.Form.Request.from_response(response, 
                ... 
               callback=self.parse1) 

     # The following method scrapes only for 1st id 
     yield Request(url="http://example.com/viewSomeOtherData", 
        callback=self.intermediateMethod) 

    def parse1(self, response): 
    # Data scraped here using selectors 

    def intermediateMethod(self, response): 
    yield scrapy.FormRequest.from_response(response, 
               ... 
              callback=self.parse2) 

    def parse2(self, response): 
    # Some other data scraped here 

我想放棄了一個ID兩個不同的頁面。

+1

Scrapy有一個重複的URL過濾器,可能這是篩選您的請求。嘗試在'callback ='後面添加'dont_filter = True'。 – Steve

+0

非常感謝。添加dont_filter解決了我的問題。 –

回答

0

更改以下行:

yield Request(url="http://example.com/viewSomeOtherData", 
       callback=self.intermediateMethod) 

到:

yield Request(url="http://example.com/viewSomeOtherData", 
       callback=self.intermediateMethod, 
       dont_filter=True) 

爲我工作。

Scrapy有一個重複的URL過濾器,可能這是篩選您的請求。嘗試按照Steve的建議添加dont_filter = True。