是啊,每次我抓住一個環節我都用的方法urlparse.urljoin。
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
我想你試圖抓住整個網址來解析它嗎?如果是這樣的話,一個簡單的兩個方法系統就可以在一個basespider上工作。解析方法找到的鏈接,它會向它輸出你提取什麼管道
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
item = ZipgrabberItem()
item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this grabs it
return item