scrapy遞歸鏈接爬蟲與登錄 - 幫助我提高

爲了我當前的知識，我已經寫了一個能夠通過變量嵌套深度遞歸爬取的小型web蜘蛛/爬蟲還可以做可選的POST/GET pre抓取之前登錄（如果需要）。scrapy遞歸鏈接爬蟲與登錄 - 幫助我提高

由於我是一個完整的初學者，我想獲得一些反饋，改進或任何你在此拋出的東西。

我只在這裏添加parser函數。可以在github上查看整個源代碼：https://github.com/cytopia/crawlpy

我真正想確定的是，與yield結合的遞歸儘可能高效，並且我也以正確的方式執行此操作。

任何意見和編碼風格非常受歡迎。

def parse(self, response): 
    """ 
    Scrapy parse callback 
    """ 

    # Get current nesting level 
    if response.meta.has_key('depth'): 
     curr_depth = response.meta['depth'] 
    else: 
     curr_depth = 1 


    # Only crawl the current page if we hit a HTTP-200 
    if response.status == 200: 
     hxs = Selector(response) 
     links = hxs.xpath("//a/@href").extract() 

     # We stored already crawled links in this list 
     crawled_links = [] 

     # Pattern to check proper link 
     linkPattern = re.compile("^(?:http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+))?$") 

     for link in links: 

      # Link could be a relative url from response.url 
      # such as link: '../test', respo.url: http://dom.tld/foo/bar 
      if link.find('../') == 0: 
       link = response.url + '/' + link 
      # Prepend BASE URL if it does not have it 
      elif 'http://' not in link and 'https://' not in link: 
       link = self.base_url + link 


      # If it is a proper link and is not checked yet, yield it to the Spider 
      if (link 
        and linkPattern.match(link) 
        and link.find(self.base_url) == 0): 
        #and link not in crawled_links 
        #and link not in uniques): 

       # Check if this url already exists 
       re_exists = re.compile('^' + link + '$') 
       exists = False 
       for i in self.uniques: 
        if re_exists.match(i): 
         exists = True 
         break 

       if not exists: 
        # Store the shit 
        crawled_links.append(link) 
        self.uniques.append(link) 

        # Do we recurse? 
        if curr_depth < self.depth: 
         request = Request(link, self.parse) 
         # Add meta-data about the current recursion depth 
         request.meta['depth'] = curr_depth + 1 
         yield request 
        else: 
         # Nesting level too deep 
         pass 
      else: 
       # Link not in condition 
       pass 


     # 
     # Final return (yield) to user 
     # 
     for url in crawled_links: 
      #print "FINAL FINAL FINAL URL: " + response.url 
      item = CrawlpyItem() 
      item['url'] = url 
      item['depth'] = curr_depth 

      yield item 
     #print "FINAL FINAL FINAL URL: " + response.url 
     #item = CrawlpyItem() 
     #item['url'] = response.url 
     #yield item 
    else: 
     # NOT HTTP 200 
     pass

來源

2016-07-27 cytopia

你的整個代碼可以縮短到類似：

from scrapy.linkextractors import LinkExtractor 
def parse(self, response): 
    # Get current nesting level 
    curr_depth = response.meta.get('depth',1) 
    item = CrawlpyItem() # could also just be `item = dict()` 
    item['url'] = response.url 
    item['depth'] = curr_depth 
    yield item 

    links = LinkExtractor().extract_links(response) 
    for link in links: 
     yield Request(link.url, meta={'depth': curr_depth+1})

如果我理解正確你想在這裏做的是廣闊的抓取所有網址，產量深度和URL作爲項目吧？

Scrapy已經在默認情況下啓用了過濾過濾器，因此您不需要自己執行該邏輯。你的parse()方法將永遠不會收到任何東西，但響應200，以便檢查是無用的。

編輯：返工以避免愚蠢。

來源

2016-07-27 18:30:02 Granitosaurus

感謝您的簡化。看起來更清潔。然而，這有兩個問題。我得到很多重複的鏈接（例如保存到json時），並且存儲了我允許的域之外的鏈接。我忽略了什麼？ – cytopia

@cytopia好抓！蜘蛛中存在一個巨大的缺陷，它會在下載它們之前返回Url，所以scrapy重複過濾器和allowed_domain過濾器從未被實際使用過。我已經修復了這個問題！ – Granitosaurus

Granitosaurus感謝您的編輯。一些註釋： 'item ['url'] = link.url'：現在'link'在for循環中定義之前被使用。或者你的意思是'response.url'？另外，請你解釋一下，爲什麼你默認「深度」爲零而不是1？ – cytopia

scrapy遞歸鏈接爬蟲與登錄 - 幫助我提高

回答

相關問題