默認情況下,您無法訪問原始啓動網址。
但是,您可以覆蓋make_requests_from_url
方法並將起始網址放入meta
。然後在解析中,你可以從那裏提取它(如果你在解析方法中產生後續請求,不要忘記在它們中轉發該起始url)。
我還沒有和CrawlSpider
工作,說不定什麼馬克西姆暗示會爲你工作,但要記住,response.url
有可能後重定向的URL。
這裏是我會怎麼做一個例子,但它只是一個例子(從scrapy教程所),並沒有測試:
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php',), deny=('subsection\.php',))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php',)), callback='parse_item'),
)
def parse(self, response): # When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
for request_or_item in CrawlSpider.parse(self, response):
if isinstance(request_or_item, Request):
request_or_item = request_or_item.replace(meta = {'start_url': response.meta['start_url']})
yield request_or_item
def make_requests_from_url(self, url):
"""A method that receives a URL and returns a Request object (or a list of Request objects) to scrape.
This method is used to construct the initial requests in the start_requests() method,
and is typically used to convert urls to requests.
"""
return Request(url, dont_filter=True, meta = {'start_url': url})
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
item['start_url'] = response.meta['start_url']
return item
問你是否有任何問題。順便說一句,使用PyDev的「定義」功能,你可以看到scrapy的來源,並瞭解什麼參數Request
,make_requests_from_url
和其他類和方法的期望。進入代碼有助於節省時間,儘管開始時可能看起來很困難。
你可以在與weblinks表相同的表中使用start_url字段嗎(就像你正在使用的DjangoItem一樣)?當然,它會創建冗餘非規範化,但如果你想避免明確調用,這可能會有所幫助。 – zubinmehta