使用Scrapy從網站上查找和下載PDF文件

我一直負責使用Scrapy從網站上拉取PDF文件。我對Python並不陌生，但Scrapy對我來說是一個新手。我一直在試驗控制檯和一些基本的蜘蛛。我發現和修改這個代碼：使用Scrapy從網站上查找和下載PDF文件

import urlparse 
import scrapy 

from scrapy.http import Request 

class pwc_tax(scrapy.Spider): 
    name = "pwc_tax" 

    allowed_domains = ["www.pwc.com"] 
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"] 

    def parse(self, response): 
     base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html" 
     for a in response.xpath('//a[@href]/@href'): 
      link = a.extract() 
      if link.endswith('.pdf'): 
       link = urlparse.urljoin(base_url, link) 
       yield Request(link, callback=self.save_pdf) 

    def save_pdf(self, response): 
     path = response.url.split('/')[-1] 
     with open(path, 'wb') as f: 
      f.write(response.body)

我運行在命令行這個代碼

scrapy crawl mySpider

，我得不到任何回報。我沒有創建scrapy項目，因爲我想抓取並下載文件，沒有元數據。我將不勝感激任何幫助。

來源

2016-03-21 Murface

你可以分享這些日誌？ – eLRuLL

蜘蛛的邏輯看起來不正確。

我有一個快速瀏覽你的網站，似乎有幾種類型的網頁：

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html初始頁面
網頁的具體條款，例如可從頁面＃1導航的http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html
實際的PDF位置，例如， http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf這可以從網頁＃2

進行導航因此，正確的邏輯是這樣的：首先得到＃1頁，得到＃2頁然後，我們可以下載這些＃3頁。
但是，您的蜘蛛試圖直接從＃1頁面提取到＃3頁面的鏈接。

編輯：

我已經更新了你的代碼，這裏的東西，實際工作：

import urlparse 
import scrapy 

from scrapy.http import Request 

class pwc_tax(scrapy.Spider): 
    name = "pwc_tax" 

    allowed_domains = ["www.pwc.com"] 
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"] 

    def parse(self, response): 
     for href in response.css('div#all_results h3 a::attr(href)').extract(): 
      yield Request(
       url=response.urljoin(href), 
       callback=self.parse_article 
      ) 

    def parse_article(self, response): 
     for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract(): 
      yield Request(
       url=response.urljoin(href), 
       callback=self.save_pdf 
      ) 

    def save_pdf(self, response): 
     path = response.url.split('/')[-1] 
     self.logger.info('Saving PDF %s', path) 
     with open(path, 'wb') as f: 
      f.write(response.body)

來源

2016-03-21 16:04:02 starrify

謝謝Starrify。 – Murface

爲了更好地理解這裏發生了什麼，這遵循了上面的邏輯，這裏沒有遞歸 – Murface

是的，沒有「遞歸」（它可能不是這裏的確切詞，因爲Scrapy是一個事件驅動的框架：只是回調）在編輯的代碼中，但在您的原始代碼。 :)另外，如果您認爲它解決了您的問題，請接受此答案。 – starrify

使用Scrapy從網站上查找和下載PDF文件

回答

相關問題