2013-09-25 144 views
1

它以網頁上的一個URL(例如:http://python.org)開頭,獲取與該URL相對應的網頁,並將該頁面上的所有鏈接解析爲鏈接存儲庫。接下來,它從剛剛創建的存儲庫中獲取任何url的內容,將這個新內容中的鏈接解析到存儲庫中,並繼續處理存儲庫中所有鏈接的這個過程,直到停止或獲取給定數量的鏈接之後。Scrapy遞歸鏈接爬蟲

我該如何使用python和scrapy?我能湊網頁中的所有鏈接,但如何在深度遞歸執行它

回答

1

的幾點意見:

  • 你不需要Scrapy對於這樣一個簡單的任務。 Urllib(或Requests)和一個html解析器(美麗的湯等)可以完成這項工作
  • 我不記得我聽說過它的位置,但我認爲最好使用BFS算法進行爬網。您可以輕鬆避免循環引用。

下面以一個簡單的實現:它不fetcch內部鏈接(只絕對建制超鏈接),也沒有任何錯誤處理(403,404,沒有鏈接,...),它是慢得可憐(在multiprocessing模塊在這種情況下可以幫助很多)。

import BeautifulSoup 
import urllib2 
import itertools 
import random 


class Crawler(object): 
    """docstring for Crawler""" 

    def __init__(self): 

     self.soup = None          # Beautiful Soup object 
     self.current_page = "http://www.python.org/"   # Current page's address 
     self.links   = set()        # Queue with every links fetched 
     self.visited_links = set() 

     self.counter = 0 # Simple counter for debug purpose 

    def open(self): 

     # Open url 
     print self.counter , ":", self.current_page 
     res = urllib2.urlopen(self.current_page) 
     html_code = res.read() 
     self.visited_links.add(self.current_page) 

     # Fetch every links 
     self.soup = BeautifulSoup.BeautifulSoup(html_code) 

     page_links = [] 
     try : 
      page_links = itertools.ifilter( # Only deal with absolute links 
              lambda href: 'http://' in href, 
               (a.get('href') for a in self.soup.findAll('a')) ) 
     except Exception: # Magnificent exception handling 
      pass 



     # Update links 
     self.links = self.links.union(set(page_links)) 



     # Choose a random url from non-visited set 
     self.current_page = random.sample(self.links.difference(self.visited_links),1)[0] 
     self.counter+=1 


    def run(self): 

     # Crawl 3 webpages (or stop if all url has been fetched) 
     while len(self.visited_links) < 3 or (self.visited_links == self.links): 
      self.open() 

     for link in self.links: 
      print link 



if __name__ == '__main__': 

    C = Crawler() 
    C.run() 

輸出:

In [48]: run BFScrawler.py 
0 : http://www.python.org/ 
1 : http://twistedmatrix.com/trac/ 
2 : http://www.flowroute.com/ 
http://www.egenix.com/files/python/mxODBC.html 
http://wiki.python.org/moin/PyQt 
http://wiki.python.org/moin/DatabaseProgramming/ 
http://wiki.python.org/moin/CgiScripts 
http://wiki.python.org/moin/WebProgramming 
http://trac.edgewall.org/ 
http://www.facebook.com/flowroute 
http://www.flowroute.com/ 
http://www.opensource.org/licenses/mit-license.php 
http://roundup.sourceforge.net/ 
http://www.zope.org/ 
http://www.linkedin.com/company/flowroute 
http://wiki.python.org/moin/TkInter 
http://pypi.python.org/pypi 
http://pycon.org/#calendar 
http://dyn.com/ 
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar. 
google.com/public/basic.ics 
http://www.pygame.org/news.html 
http://www.turbogears.org/ 
http://www.openbookproject.net/pybiblio/ 
http://wiki.python.org/moin/IntegratedDevelopmentEnvironments 
http://support.flowroute.com/forums 
http://www.pentangle.net/python/handbook/ 
http://dreamhost.com/?q=twisted 
http://www.vrplumber.com/py3d.py 
http://sourceforge.net/projects/mysql-python 
http://wiki.python.org/moin/GuiProgramming 
http://software-carpentry.org/ 
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar. 
google.com/public/basic.ics 
http://wiki.python.org/moin/WxPython 
http://wiki.python.org/moin/PythonXml 
http://www.pytennessee.org/ 
http://labs.twistedmatrix.com/ 
http://www.found.no/ 
http://www.prnewswire.com/news-releases/voip-innovator-flowroute-relocates-to-se 
attle-190011751.html 
http://www.timparkin.co.uk/ 
http://docs.python.org/howto/sockets.html 
http://blog.python.org/ 
http://docs.python.org/devguide/ 
http://www.djangoproject.com/ 
http://buildbot.net/trac 
http://docs.python.org/3/ 
http://www.prnewswire.com/news-releases/flowroute-joins-voxbones-inum-network-fo 
r-global-voip-calling-197319371.html 
http://www.psfmember.org 
http://docs.python.org/2/ 
http://wiki.python.org/moin/Languages 
http://sip-trunking.tmcnet.com/topics/enterprise-voip/articles/341902-grandstrea 
m-ip-voice-solutions-receive-flowroute-certification.htm 
http://www.twitter.com/flowroute 
http://wiki.python.org/moin/NumericAndScientific 
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar. 
google.com/public/basic.ics 
http://freecode.com/projects/pykyra 
http://www.xs4all.com/ 
http://blog.flowroute.com 
http://wiki.python.org/moin/PyGtk 
http://twistedmatrix.com/trac/ 
http://wiki.python.org/moin/ 
http://wiki.python.org/moin/Python2orPython3 
http://stackoverflow.com/questions/tagged/twisted 
http://www.pycon.org/ 
+0

由於一噸@georgesl爲了節省我的勞動天。實際上我對這個遞歸爬行非常困惑。我在生產中使用的這段代碼對於如何通過實施多重處理來說明如何快速完成模塊將是有益的模塊 –

0

這裏是從網頁寫入廢料鏈接遞歸主爬行方法。該方法將抓取URL並將所有抓取的URL放入緩衝區中。現在多個線程將等待從此全局緩衝區中彈出URL並再次調用此爬網方法。

def crawl(self,urlObj): 
    '''Main function to crawl URL's ''' 

    try: 
     if ((urlObj.valid) and (urlObj.url not in CRAWLED_URLS.keys())): 
      rsp = urlcon.urlopen(urlObj.url,timeout=2) 
      hCode = rsp.read() 
      soup = BeautifulSoup(hCode) 
      links = self.scrap(soup) 
      boolStatus = self.checkmax() 
      if boolStatus: 
       CRAWLED_URLS.setdefault(urlObj.url,"True") 
      else: 
       return 
      for eachLink in links: 
       if eachLink not in VISITED_URLS: 
        parsedURL = urlparse(eachLink) 
        if parsedURL.scheme and "javascript" in parsedURL.scheme: 
         #print("***************Javascript found in scheme " + str(eachLink) + "**************") 
         continue 
        '''Handle internal URLs ''' 
        try: 
         if not parsedURL.scheme and not parsedURL.netloc: 
          #print("No scheme and host found for " + str(eachLink)) 
          newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme,"netloc":urlObj.netloc})) 
          eachLink = newURL 
         elif not parsedURL.scheme : 
          #print("Scheme not found for " + str(eachLink)) 
          newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme})) 
          eachLink = newURL 
         if eachLink not in VISITED_URLS: #Check again for internal URL's 
          #print(" Found child link " + eachLink) 
          CRAWL_BUFFER.append(eachLink) 
          with self._lock: 
           self.count += 1 
           #print(" Count is =================> " + str(self.count)) 
          boolStatus = self.checkmax() 
          if boolStatus: 
           VISITED_URLS.setdefault(eachLink, "True") 
          else: 
           return 
        except TypeError: 
         print("Type error occured ") 
     else: 
      print("URL already present in visited " + str(urlObj.url)) 
    except socket.timeout as e: 
     print("**************** Socket timeout occured*******************") 
    except URLError as e: 
     if isinstance(e.reason, ConnectionRefusedError): 
      print("**************** Conn refused error occured*******************") 
     elif isinstance(e.reason, socket.timeout): 
      print("**************** Socket timed out error occured***************") 
     elif isinstance(e.reason, OSError): 
      print("**************** OS error occured*************") 
     elif isinstance(e,HTTPError): 
      print("**************** HTTP Error occured*************") 
     else: 
      print("**************** URL Error occured***************") 
    except Exception as e: 
     print("Unknown exception occured while fetching HTML code" + str(e)) 
     traceback.print_exc() 

完整的源代碼和指令可在https://github.com/tarunbansal/crawler

+0

您應該在您的答案中發佈相關代碼。 – Bram

+0

僅適用於python3! – erdomester