我知道有與此相關的有十幾個問題,但沒有,我看到他們的真正的蜘蛛有一種以上的方法...Scrapy抓取蜘蛛,下面麻煩鏈接
所以我刮網站,從類別頁面開始。我抓取了產品類別的鏈接,然後嘗試利用抓取蜘蛛的規則自動遍歷每個類別的「下一頁」頁面,在每一步中抓取頁面中的某些信息。
問題是,我只是轉到每個類別的第一頁,並且似乎忽略了我設置的Rule = True方面。因此,這裏的代碼,也許需要一些幫助:
start_urls = ["http://home.mercadolivre.com.br/mais-categorias/"]
rules = (
# I would like this to force the spider to crawl through the pages... calling the product parser each time
Rule(LxmlLinkExtractor(allow=(),
restrict_xpaths = '//*[@id="results-section"]/div[2]/ul/li[@class="pagination__next"]'), follow = True, callback = 'parse_product_links'),
)
def parse(self, response):
categories = CategoriesItem()
#categories['categoryLinks'] = []
for link in LxmlLinkExtractor(allow=('(?<=http://lista.mercadolivre.com.br/delicatessen/)(?:whisky|licor|tequila|vodka|champagnes)'), restrict_xpaths = ("//body")).extract_links(response):
categories['categoryURL'] = link.url
yield Request(link.url, meta={'categoryURL': categories['categoryURL']}, callback = self.parse_product_links)
# ideally this function would grab the product links from each page
def parse_product_links(self, response):
# I have this built out in my code, but it isnt necessary so I wanted to keep it as de-cluttered as possible
希望得到任何幫助,您可以給,因爲它看起來好像我並不完全瞭解如何喜歡規則使用方法我想使用它們提取內(這就是爲什麼我有「parse_product_links」作爲在兩個位置
嘿芽,你不應該使用函數內的規則......他們在規則頂部......認爲像一步一步的規則,它需要tio到你的初始頁面給你的項目。 ..每個規則是要獲得鏈接獲得喲 https://stackoverflow.com/questions/15192362/how-to-properly-use-rules-restrict-xpaths-to-crawl-and-parse-urls-with -scrapy – scriptso