檢索抓取的網址Scrapy

我已經構建了一個抓取工具來抓取使用Scrapy的特定網站。如果url匹配給定的正則表達式，並且如果url匹配其他定義的正則表達式，則調用回調函數。構建抓取工具的主要目的是提取網站中所有必需的鏈接，而不是鏈接中的內容。任何人都可以告訴我如何打印所有已爬網鏈接的列表。該代碼是：檢索抓取的網址Scrapy

name = "xyz" 
allowed_domains = ["xyz.com"] 
start_urls = ["http://www.xyz.com/Vacanciess"] 
rules = (Rule(SgmlLinkExtractor(allow=[regex2]),callback='parse_item'),Rule(SgmlLinkExtractor(allow=[regex1]), follow=True),) 



def parse_item(self, response): 
#sel = Selector(response) 

#title = sel.xpath("//h1[@class='no-bd']/text()").extract() 
#print title 
print response

的

print title

代碼工作得很好。但正如上面的代碼，如果我嘗試T打印的實際響應，它返回我：

[xyz] DEBUG: Crawled (200)<GET http://www.xyz.com/urlmatchingregex2> (referer: http://www.xyz.com/urlmatchingregex1) 
<200 http://www.xyz.com/urlmatchingregex2>

請人幫我找回實際的URL。

來源

2014-04-04 sulav_lfc

您可以打印response.url中的parse_item方法來打印抓取的網址。它被記錄在here。

來源

2014-04-04 23:41:58 shaktimaan

謝謝你，那正是我在找的:) –

檢索抓取的網址Scrapy

回答

相關問題