2013-04-22 91 views
0

我想打開pdf文件所在的網頁上的所有鏈接,並將這些pdf文件存儲在我的系統上。在使用scrapy製作的網絡爬蟲中調用另一個蜘蛛的一個蜘蛛

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from bs4 import BeautifulSoup 


class spider_a(BaseSpider): 
    name = "Colleges" 
    allowed_domains = ["http://www.abc.org"] 
    start_urls = [ 
     "http://www.abc.org/appwebsite.html", 
     "http://www.abc.org/misappengineering.htm", 
    ] 

    def parse(self, response): 
     soup = BeautifulSoup(response.body) 
     for link in soup.find_all('a'): 
      download_link = link.get('href') 
      if '.pdf' in download_link: 
       pdf_url = "http://www.abc.org/" + download_link 
       print pdf_url 

與上面的代碼,我能夠找到預期的網頁的鏈接,其中的PDF文件所在

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

class FileSpider(BaseSpider): 
    name = "fspider" 
    allowed_domains = ["www.aicte-india.org"] 
    start_urls = [ 
     "http://www.abc.org/downloads/approved_institut_websites/an.pdf#toolbar=0&zoom=85" 
    ] 

    def parse(self, response): 
     filename = response.url.split("/")[-1] 
     open(filename, 'wb').write(response.body) 

有了這個代碼,我可以節省start_urls列出的網頁身體。

是否有加入這兩個蜘蛛的方法,以便我可以通過運行我的爬蟲來保存這些pdf文件?

回答

2

爲什麼你需要兩隻蜘蛛?

from urlparse import urljoin 
from scrapy.http import Request 
from scrapy.selector import HtmlXPathSelector 

class spider_a(BaseSpider): 
    ... 
    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     for href in hxs.select('//a/@href[contains(.,".pdf")]'): 
      yield Request(urljoin(response.url, href), 
        callback=self.save_file) 

    def save_file(self, response): 
     filename = response.url.split("/")[-1] 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
+0

嗨@steven感謝您的幫助 但我收到以下錯誤: exceptions.AttributeError:「HtmlXPathSelector」對象有沒有屬性「發現」 – user2253803 2013-04-23 04:58:23

+0

那是因爲你需要使用'select',不是'發現'...如果你正在使用Scrapy,你不需要美麗的湯。 – 2013-04-23 13:41:49