2016-08-18 160 views
0

我試圖運行我在網上找到的http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/Python的網絡爬蟲(NameError:名字「蜘蛛」沒有定義)

然而,通過Python的3.5運行時,我遇到問題的例子。 2殼牌。

spider("http://www.dreamhost.com", "secure", 200) 給我的留言:
回溯(最近通話最後一個): 文件 「」,1號線,在 蜘蛛( 「http://www.dreamhost.com」, 「安全」,200) NameError:名字 '蜘蛛'沒有定義

from html.parser import HTMLParser 
from urllib.request import urlopen 
from urllib import parse 

class LinkParser(HTMLParser): 

def handle_starttag(self, tag, attrs): 
    if tag == 'a': 
     for (key, value) in attrs: 
      if key == 'href': 
       newUrl = parse.urljoin(self.baseUrl, value) 
       self.links = self.links + [newUrl] 

def getLinks(self, url): 
    self.links = [] 
    self.baseUrl = url 
    response = urlopen(url) 
    if response.getheader('Content-Type')=='text/html': 
     htmlBytes = response.read() 
     htmlString = htmlBytes.decode("utf-8") 
     self.feed(htmlString) 
     return htmlString, self.links 
    else: 
     return "",[] 

def spider(url, word, maxPages): 
    pagesToVisit = [url] 
    numberVisited = 0 
    foundWord = False 
    while numberVisited < maxPages and pagesToVisit != [] and not  foundWord: 
    numberVisited = numberVisited +1 
    url = pagesToVisit[0] 
    pagesToVisit = pagesToVisit[1:] 
    try: 
     print(numberVisited, "Visiting:", url) 
     parser = LinkParser() 
     data, links = parser.getLinks(url) 
     if data.find(word)>-1: 
      foundWord = True 
     pagesToVisit = pagesToVisit + links 
     print(" **Success!**") 
    except: 
     print(" **Failed!**") 
if foundWord: 
    print("The word", word, "was found at", url) 
else: 
    print("Word never found") 
+0

你如何運行它?您在REPL中輸入的所有說明是什麼? –

回答

0

Yourno,

你在你的代碼的好友都有縮進問題。在定義類之後,在方法handle_starttaggetLinks之前沒有縮進。同樣在函數spider中,在if-else部分中缺少缺口。請根據您提供的鏈接上發佈的代碼檢查您的代碼。請找到以下更新的工作代碼:

from html.parser import HTMLParser 
from urllib.request import urlopen 
from urllib import parse 

# We are going to create a class called LinkParser that inherits some 
# methods from HTMLParser which is why it is passed into the definition 
class LinkParser(HTMLParser): 

    # This is a function that HTMLParser normally has 
    # but we are adding some functionality to it 
    def handle_starttag(self, tag, attrs): 
     # We are looking for the begining of a link. Links normally look 
     # like <a href="www.someurl.com"></a> 
     if tag == 'a': 
      for (key, value) in attrs: 
       if key == 'href': 
        # We are grabbing the new URL. We are also adding the 
        # base URL to it. For example: 
        # www.netinstructions.com is the base and 
        # somepage.html is the new URL (a relative URL) 
        # 
        # We combine a relative URL with the base URL to create 
        # an absolute URL like: 
        # www.netinstructions.com/somepage.html 
        newUrl = parse.urljoin(self.baseUrl, value) 
        # And add it to our colection of links: 
        self.links = self.links + [newUrl] 

    # This is a new function that we are creating to get links 
    # that our spider() function will call 
    def getLinks(self, url): 
     self.links = [] 
     # Remember the base URL which will be important when creating 
     # absolute URLs 
     self.baseUrl = url 
     # Use the urlopen function from the standard Python 3 library 
     response = urlopen(url) 
     # Make sure that we are looking at HTML and not other things that 
     # are floating around on the internet (such as 
     # JavaScript files, CSS, or .PDFs for example) 
     if response.getheader('Content-Type')=='text/html': 
      htmlBytes = response.read() 
      # Note that feed() handles Strings well, but not bytes 
      # (A change from Python 2.x to Python 3.x) 
      htmlString = htmlBytes.decode("utf-8") 
      self.feed(htmlString) 
      return htmlString, self.links 
     else: 
      return "",[] 

# And finally here is our spider. It takes in an URL, a word to find, 
# and the number of pages to search through before giving up 
def spider(url, word, maxPages): 
    pagesToVisit = [url] 
    numberVisited = 0 
    foundWord = False 
    # The main loop. Create a LinkParser and get all the links on the page. 
    # Also search the page for the word or string 
    # In our getLinks function we return the web page 
    # (this is useful for searching for the word) 
    # and we return a set of links from that web page 
    # (this is useful for where to go next) 
    while numberVisited < maxPages and pagesToVisit != [] or not foundWord: 
     numberVisited = numberVisited +1 
     # Start from the beginning of our collection of pages to visit: 
     url = pagesToVisit[0] 
     pagesToVisit = pagesToVisit[1:] 
     try: 
      print(numberVisited, "Visiting:", url) 
      parser = LinkParser() 
      data, links = parser.getLinks(url) 
      if data.find(word)>-1: 
       foundWord = True 
       foundAtUrl = url 
       # Add the pages that we visited to the end of our collection 
       # of pages to visit: 
       pagesToVisit = pagesToVisit + links 
       print(" **Success!**") 
      #Added else, so if desired word not found, then make foundWord = False 
      else: 
       foundWord = False 
     except: 
      print(" **Failed!**") 
     #Moved this if-else condition block inside while loop, so for every url, it will give us message whether the desired word found or not 
     if foundWord: 
      print("The word", word, "was found at", url) 
     else: 
      print("Word never found") 

spider("http://www.dreamhost.com", "secure", 200) 

請讓我知道,如果您仍然有任何問題/查詢。

+0

謝謝你的迴應汗。不幸的是,即使使用您提供的「工作代碼」,仍然會遇到原始錯誤。 –

+0

Hello @ EmilioPagan-Yourno,是的,我知道你使用python shell來運行代碼。我更新了我的上面的代碼。它肯定會工作。讓我知道。 –

+0

非常感謝您的幫助。我想到了! –