我假設您要跟隨的網址導致具有相同或相似結構的網頁。如果是這樣的話,你應該做這樣的事情:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = 'yourCrawler'
allowed_domains = 'domain.com'
start_urls = ["htttp://www.domain.com/example/url"]
def parse(self, response):
#parse any elements you need from the start_urls and, optionally, store them as Items.
# See http://doc.scrapy.org/en/latest/topics/items.html
s = Selector(response)
urls = s.xpath('//div[@id="example"]//a/@href').extract()
for url in urls:
yield Request(url, callback=self.parse_following_urls, dont_filter=True)
def parse_following_urls(self, response):
#Parsing rules go here
否則,如果你想跟着導致結構不同網頁的網址,然後就可以定義特定的方法對他們(類似parse1,parse2,parse3 ...)。
我認爲你應該重新閱讀你的[早期問題]的答案(http://stackoverflow.com/questions/27779889/scraping-many-pages-using-scrapy)。您不會生成URL列表,您可以從start_request中爲這些URL返回一個新的Request對象列表。 – fnl