Scrapy的條件網址抓取

我想在我不知道網址結構的網站上使用Scrapy。Scrapy的條件網址抓取

我想：從含有Xpath的網頁

僅提取數據「// DIV [@類=」產品視點「]」。
提取打印（在CSV）的URL，名稱和價格的XPath

當我運行下面的腳本，我得到的是URL的

scrapy crawl dmoz>test.txt

隨機列表

from scrapy.selector import HtmlXPathSelector 
from scrapy.spider import BaseSpider 
from scrapy.http import Request 

DOMAIN = 'site.com' 
URL = 'http://%s' % DOMAIN 

class MySpider(BaseSpider): 
    name = "dmoz" 
    allowed_domains = [DOMAIN] 
    start_urls = [ 
     URL 
    ] 

    def parse(self, response): 
     for url in response.xpath('//a/@href').extract(): 
      if not (url.startswith('http://') or url.startswith('https://')): 
       url= URL + url 
      if response.xpath('//div[@class="product-view"]'): 
       url = response.extract() 
       name = response.xpath('//div[@class="product-name"]/h1/text()').extract() 
       price = response.xpath('//span[@class="product_price_details"]/text()').extract() 
      yield Request(url, callback=self.parse) 
      print url

來源

2016-07-27 Ycon

你在這裏找的是scrapy.spiders.Crawlspider。

然而，你幾乎用自己的方法得到它。這是固定版本。

from scrapy.linkextractors import LinkExtractor 
def parse(self, response): 
    # parse this page 
    if response.xpath('//div[@class="product-view"]'): 
     item = dict() 
     item['url'] = response.url 
     item['name'] = response.xpath('//div[@class="product-name"]/h1/text()').extract_first() 
     item['price'] = response.xpath('//span[@class="product_price_details"]/text()').extract_first() 
     yield item # return an item with your data 
    # other pages 
    le = LinkExtractor() # linkextractor is smarter than xpath '//a/@href' 
    for link in le.extract_links(response): 
     yield Request(link.url) # default callback is already self.parse

現在你可以簡單地運行scrapy crawl myspider -o results.csv，scrapy會輸出你物品的csv。雖然特別注意日誌和統計信息，但是這就是你知道是否出錯的原因

來源

2016-07-27 07:24:37 Granitosaurus

Scrapy的條件網址抓取

回答

相關問題