2012-05-19 70 views
11

我想要報廢http://www.3andena.com/,本網站首先以阿拉伯語開頭,並將語言設置存儲在cookie中。如果您嘗試直接通過URL()訪問語言版本,則會造成問題並返回服務器錯誤。如何在scrapy中覆蓋/使用cookies

因此,我想將cookie值「store_language」設置爲「en」,然後使用此cookie值開始廢棄網站。

我使用CrawlSpider與一些規則。

這裏的代碼

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy import log 
from bkam.items import Product 
from scrapy.http import Request 
import re 

class AndenaSpider(CrawlSpider): 
    name = "andena" 
    domain_name = "3andena.com" 
    start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"] 

    product_urls = [] 

    rules = (
    # The following rule is for pagination 
    Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True), 
    # The following rule is for produt details 
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True), 
    ) 

    def start_requests(self): 
    yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'}) 

    for url in self.start_urls: 
     yield Request(url, callback=self.parse_category) 


    def parse_category(self, response): 
    hxs = HtmlXPathSelector(response) 

    self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract()) 

    for product in self.product_urls: 
     yield Request(product, callback=self.parse_product) 


    def parse_product(self, response): 
    hxs = HtmlXPathSelector(response) 
    items = [] 
    item = Product() 

    ''' 
    some parsing 
    ''' 

    items.append(item) 
    return items 

SPIDER = AndenaSpider() 

這裏的日誌:

2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en> 
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> 
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None) 
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10) 

回答

2
Scrapy documentation for Requests and Responses.

直你需要像這樣

request_with_cookies = Request(url="http://www.3andena.com", cookies={'store_language':'en'}) 
+1

我在發佈我的問題之前已經嘗試過了,但它不起作用 –

+0

您能否提交您的源代碼? – VenkatH

+0

我剛添加它 –

6

修改鱈魚上課如下:

def start_requests(self): 
    for url in self.start_urls: 
     yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category) 

Scrapy.Request對象接受可選cookies關鍵字參數,see documentation here

6

這是我要做的事是Scrapy 0.24.6的:

from scrapy.contrib.spiders import CrawlSpider, Rule 

class MySpider(CrawlSpider): 

    ... 

    def make_requests_from_url(self, url): 
     request = super(MySpider, self).make_requests_from_url(url) 
     request.cookies['foo'] = 'bar' 
     return request 

Scrapy調用make_requests_from_url與蜘蛛的start_urls屬性中的URL。上面的代碼正在做的是讓默認實現創建請求,然後添加一個foo cookie值爲bar。 (或改變的cookie的值bar如果恰巧,克服一切困難,已經存在於缺省實現產生的請求foo餅乾。)

如果你想與那些請求會發生什麼而不是從start_urls創建的,請允許我補充說,Scrapy的Cookie中間件會記住使用上述代碼設置的Cookie,並將其設置爲與您明確添加Cookie的請求共享相同域的所有未來請求。