0
這是我第一次使用網絡抓取的經驗,我不知道我是否做得好。關鍵是我想同時抓取和抓取數據。如何同時抓取和抓取數據?
- 得到所有我會刮掉
- 商店他們到MongoDB的 鏈接
訪問逐一刮其內容
# Crawling: get all links to be scrapped later on class LinkCrawler(Spider): name="link" allowed_domains = ["website.com"] start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)] def parse(self,response): # loop for all pages next_page = Selector(response).xpath('//li[@class="active"]/following-sibling::li[1]/a/@href').extract() if not not next_page: yield Request("https://"+next_page[0], callback = self.parse) # loop for all links in a single page links = Selector(response).xpath('//div[@class="row-fluid job-details pointer"]/div[@class="bloc-right"]/div[@class="row-fluid"]') for link in links: item = Link() url = response.urljoin(link.xpath('a/@href')[0].extract()) item['url'] = url items.append(item) for item in items: yield item # Scraping: get all the stored links on MongoDB and scrape them????
嘿,非常感謝。我在刮的網站是電子商務網站,人們出售物品,一旦出售,他們將其刪除。因此,爲了讓我知道哪些產品銷售得很快,我認爲我必須保存鏈接,以便稍後檢查是否刪除或不刪除。另外,如果有可能在mongodb上存儲該鏈接之前刮取每個鏈接的內容,請告訴我該怎麼做? –
如果指向個別產品的鏈接遵循一些常見模式,則最好使用['CrawlSpider'](https://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider)和適當的規則。 –
是的個別產品,但有一個tuto在那裏?我想訪問每一個鏈接,並提取在那裏暴露的數據... –