我想要報廢http://www.3andena.com/,本網站首先以阿拉伯語開頭,並將語言設置存儲在cookie中。如果您嘗試直接通過URL()訪問語言版本,則會造成問題並返回服務器錯誤。如何在scrapy中覆蓋/使用cookies
因此,我想將cookie值「store_language」設置爲「en」,然後使用此cookie值開始廢棄網站。
我使用CrawlSpider與一些規則。
這裏的代碼
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re
class AndenaSpider(CrawlSpider):
name = "andena"
domain_name = "3andena.com"
start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]
product_urls = []
rules = (
# The following rule is for pagination
Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
# The following rule is for produt details
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
)
def start_requests(self):
yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})
for url in self.start_urls:
yield Request(url, callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())
for product in self.product_urls:
yield Request(product, callback=self.parse_product)
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Product()
'''
some parsing
'''
items.append(item)
return items
SPIDER = AndenaSpider()
這裏的日誌:
2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
我在發佈我的問題之前已經嘗試過了,但它不起作用 –
您能否提交您的源代碼? – VenkatH
我剛添加它 –