2012-08-14 44 views
0

我抓取的網站有一個JavaScript,它設置一個cookie並在後端檢查它以確保js已啓用。從html代碼中提取cookie非常簡單,但是在scrapy中設置它似乎是一個問題。所以我的代碼是:在scrapy中設置粘性餅乾

from scrapy.contrib.spiders.init import InitSpider 

class TestSpider(InitSpider): 
    ... 
    rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html',)), callback='parse_page'),) 

    def init_request(self): 
     return Request(url = self.init_url, callback=self.parse_js) 

    def parse_js(self, response): 
     match = re.search('setCookie\(\'(.+?)\',\s*?\'(.+?)\',', response.body, re.M) 
     if match: 
      cookie = match.group(1) 
      value = match.group(2) 
     else: 
      raise BaseException("Did not find the cookie", response.body) 
     return Request(url=self.test_page, callback=self.check_test_page, cookies={cookie:value}) 

    def check_test_page(self, response): 
     if 'Welcome' in response.body: 
      self.initialized() 

    def parse_page(self, response): 
     scraping.... 

我可以看到內容check_test_page可用,餅乾完美的作品。但它從來沒有得到parse_page,因爲沒有正確的cookie的CrawlSpider沒有看到任何鏈接。有沒有一種方法可以在拼音會話期間設置cookie?或者我必須使用BaseSpider並手動將cookie添加到每個請求?

不太可取的替代方法是通過scrapy配置文件以某種方式設置cookie(該值似乎永不改變)。那可能嗎?

+0

Scrapy默認傳遞所有cookies:http://doc.scrapy.org/en/latest/ faq.html#do-scrapy-manage-cookies-automatically – 2012-08-15 17:51:25

+0

這就是服務器設置的cookie。據我所見,無法從客戶端添加永久cookie(scrapy)。必須單獨完成每個請求 – Leo 2012-08-16 09:55:35

回答

0

原來InitSpider是BaseSpider。因此,它看起來像1)在這種情況下沒有辦法使用CrawlSpider 2)沒有辦法設置粘性餅乾

1

我以前沒有用過InitSpider

綜觀代碼scrapy.contrib.spiders.init.InitSpider我看到:

def initialized(self, response=None): 
    """This method must be set as the callback of your last initialization 
    request. See self.init_request() docstring for more info. 
    """ 
    self._init_complete = True 
    reqs = self._postinit_reqs[:] 
    del self._postinit_reqs 
    return reqs 

def init_request(self): 
    """This function should return one initialization request, with the 
    self.initialized method as callback. When the self.initialized method 
    is called this spider is considered initialized. If you need to perform 
    several requests for initializing your spider, you can do so by using 
    different callbacks. The only requirement is that the final callback 
    (of the last initialization request) must be self.initialized. 

    The default implementation calls self.initialized immediately, and 
    means that no initialization is needed. This method should be 
    overridden only when you need to perform requests to initialize your 
    spider 
    """ 
    return self.initialized() 

您寫道:

我可以看到內容check_test_page可用, 完美的作品餅乾。但它從來沒有得到parse_page因爲 CrawlSpider沒有正確的cookie沒有看到任何鏈接。

我覺得parse_page不叫,因爲你並沒有注意到self.initialized作爲回調的請求。

我認爲這應該工作:

def check_test_page(self, response): 
    if 'Welcome' in response.body: 
     return self.initialized() 
+0

你是對的,我已經看過了自己的源代碼。但事實證明,InitSpider無論如何都是BaseSpider。所以它看起來像1)在這種情況下沒有辦法使用CrawlSpider 2)沒有辦法設置粘性cookie – Leo 2012-08-15 15:50:07