2016-07-27 50 views
0

爲了我當前的知識,我已經寫了一個能夠通過變量嵌套深度遞歸爬取的小型web蜘蛛/爬蟲還可以做可選的POST/GET pre抓取之前登錄(如果需要)。scrapy遞歸鏈接爬蟲與登錄 - 幫助我提高

由於我是一個完整的初學者,我想獲得一些反饋,改進或任何你在此拋出的東西。

我只在這裏添加parser函數。可以在github上查看整個源代碼:https://github.com/cytopia/crawlpy

我真正想確定的是,與yield結合的遞歸儘可能高效,並且我也以正確的方式執行此操作。

任何意見和編碼風格非常受歡迎。

def parse(self, response): 
    """ 
    Scrapy parse callback 
    """ 

    # Get current nesting level 
    if response.meta.has_key('depth'): 
     curr_depth = response.meta['depth'] 
    else: 
     curr_depth = 1 


    # Only crawl the current page if we hit a HTTP-200 
    if response.status == 200: 
     hxs = Selector(response) 
     links = hxs.xpath("//a/@href").extract() 

     # We stored already crawled links in this list 
     crawled_links = [] 

     # Pattern to check proper link 
     linkPattern = re.compile("^(?:http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$") 

     for link in links: 

      # Link could be a relative url from response.url 
      # such as link: '../test', respo.url: http://dom.tld/foo/bar 
      if link.find('../') == 0: 
       link = response.url + '/' + link 
      # Prepend BASE URL if it does not have it 
      elif 'http://' not in link and 'https://' not in link: 
       link = self.base_url + link 


      # If it is a proper link and is not checked yet, yield it to the Spider 
      if (link 
        and linkPattern.match(link) 
        and link.find(self.base_url) == 0): 
        #and link not in crawled_links 
        #and link not in uniques): 

       # Check if this url already exists 
       re_exists = re.compile('^' + link + '$') 
       exists = False 
       for i in self.uniques: 
        if re_exists.match(i): 
         exists = True 
         break 

       if not exists: 
        # Store the shit 
        crawled_links.append(link) 
        self.uniques.append(link) 

        # Do we recurse? 
        if curr_depth < self.depth: 
         request = Request(link, self.parse) 
         # Add meta-data about the current recursion depth 
         request.meta['depth'] = curr_depth + 1 
         yield request 
        else: 
         # Nesting level too deep 
         pass 
      else: 
       # Link not in condition 
       pass 


     # 
     # Final return (yield) to user 
     # 
     for url in crawled_links: 
      #print "FINAL FINAL FINAL URL: " + response.url 
      item = CrawlpyItem() 
      item['url'] = url 
      item['depth'] = curr_depth 

      yield item 
     #print "FINAL FINAL FINAL URL: " + response.url 
     #item = CrawlpyItem() 
     #item['url'] = response.url 
     #yield item 
    else: 
     # NOT HTTP 200 
     pass 

回答

2

你的整個代碼可以縮短到類似:

from scrapy.linkextractors import LinkExtractor 
def parse(self, response): 
    # Get current nesting level 
    curr_depth = response.meta.get('depth',1) 
    item = CrawlpyItem() # could also just be `item = dict()` 
    item['url'] = response.url 
    item['depth'] = curr_depth 
    yield item 

    links = LinkExtractor().extract_links(response) 
    for link in links: 
     yield Request(link.url, meta={'depth': curr_depth+1}) 

如果我理解正確你想在這裏做的是廣闊的抓取所有網址,產量深度和URL作爲項目吧?

Scrapy已經在默認情況下啓用了過濾過濾器,因此您不需要自己執行該邏輯。你的parse()方法將永遠不會收到任何東西,但響應200,以便檢查是無用的。

編輯:返工以避免愚蠢。

+0

感謝您的簡化。看起來更清潔。然而,這有兩個問題。我得到很多重複的鏈接(例如保存到json時),並且存儲了我允許的域之外的鏈接。我忽略了什麼? – cytopia

+0

@cytopia好抓!蜘蛛中存在一個巨大的缺陷,它會在下載它們之前返回Url,所以scrapy重複過濾器和allowed_domain過濾器從未被實際使用過。我已經修復了這個問題! – Granitosaurus

+0

Granitosaurus感謝您的編輯。一些註釋: 'item ['url'] = link.url':現在'link'在for循環中定義之前被使用。或者你的意思是'response.url'?另外,請你解釋一下,爲什麼你默認「深度」爲零而不是1? – cytopia