Python的網絡爬蟲（NameError：名字「蜘蛛」沒有定義）

我試圖運行我在網上找到的http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/Python的網絡爬蟲（NameError：名字「蜘蛛」沒有定義）

然而，通過Python的3.5運行時，我遇到問題的例子。 2殼牌。

spider("http://www.dreamhost.com", "secure", 200) 給我的留言：
回溯（最近通話最後一個）：文件「」，1號線，在蜘蛛（「http://www.dreamhost.com」，「安全」，200） NameError：名字 '蜘蛛'沒有定義

from html.parser import HTMLParser 
from urllib.request import urlopen 
from urllib import parse 

class LinkParser(HTMLParser): 

def handle_starttag(self, tag, attrs): 
    if tag == 'a': 
     for (key, value) in attrs: 
      if key == 'href': 
       newUrl = parse.urljoin(self.baseUrl, value) 
       self.links = self.links + [newUrl] 

def getLinks(self, url): 
    self.links = [] 
    self.baseUrl = url 
    response = urlopen(url) 
    if response.getheader('Content-Type')=='text/html': 
     htmlBytes = response.read() 
     htmlString = htmlBytes.decode("utf-8") 
     self.feed(htmlString) 
     return htmlString, self.links 
    else: 
     return "",[] 

def spider(url, word, maxPages): 
    pagesToVisit = [url] 
    numberVisited = 0 
    foundWord = False 
    while numberVisited < maxPages and pagesToVisit != [] and not  foundWord: 
    numberVisited = numberVisited +1 
    url = pagesToVisit[0] 
    pagesToVisit = pagesToVisit[1:] 
    try: 
     print(numberVisited, "Visiting:", url) 
     parser = LinkParser() 
     data, links = parser.getLinks(url) 
     if data.find(word)>-1: 
      foundWord = True 
     pagesToVisit = pagesToVisit + links 
     print(" **Success!**") 
    except: 
     print(" **Failed!**") 
if foundWord: 
    print("The word", word, "was found at", url) 
else: 
    print("Word never found")

來源

2016-08-18 Emilio Pagan-Yourno

你如何運行它？您在REPL中輸入的所有說明是什麼？ –

Yourno，

你在你的代碼的好友都有縮進問題。在定義類之後，在方法handle_starttag和getLinks之前沒有縮進。同樣在函數spider中，在if-else部分中缺少缺口。請根據您提供的鏈接上發佈的代碼檢查您的代碼。請找到以下更新的工作代碼：

from html.parser import HTMLParser 
from urllib.request import urlopen 
from urllib import parse 

# We are going to create a class called LinkParser that inherits some 
# methods from HTMLParser which is why it is passed into the definition 
class LinkParser(HTMLParser): 

    # This is a function that HTMLParser normally has 
    # but we are adding some functionality to it 
    def handle_starttag(self, tag, attrs): 
     # We are looking for the begining of a link. Links normally look 
     # like <a href="www.someurl.com"></a> 
     if tag == 'a': 
      for (key, value) in attrs: 
       if key == 'href': 
        # We are grabbing the new URL. We are also adding the 
        # base URL to it. For example: 
        # www.netinstructions.com is the base and 
        # somepage.html is the new URL (a relative URL) 
        # 
        # We combine a relative URL with the base URL to create 
        # an absolute URL like: 
        # www.netinstructions.com/somepage.html 
        newUrl = parse.urljoin(self.baseUrl, value) 
        # And add it to our colection of links: 
        self.links = self.links + [newUrl] 

    # This is a new function that we are creating to get links 
    # that our spider() function will call 
    def getLinks(self, url): 
     self.links = [] 
     # Remember the base URL which will be important when creating 
     # absolute URLs 
     self.baseUrl = url 
     # Use the urlopen function from the standard Python 3 library 
     response = urlopen(url) 
     # Make sure that we are looking at HTML and not other things that 
     # are floating around on the internet (such as 
     # JavaScript files, CSS, or .PDFs for example) 
     if response.getheader('Content-Type')=='text/html': 
      htmlBytes = response.read() 
      # Note that feed() handles Strings well, but not bytes 
      # (A change from Python 2.x to Python 3.x) 
      htmlString = htmlBytes.decode("utf-8") 
      self.feed(htmlString) 
      return htmlString, self.links 
     else: 
      return "",[] 

# And finally here is our spider. It takes in an URL, a word to find, 
# and the number of pages to search through before giving up 
def spider(url, word, maxPages): 
    pagesToVisit = [url] 
    numberVisited = 0 
    foundWord = False 
    # The main loop. Create a LinkParser and get all the links on the page. 
    # Also search the page for the word or string 
    # In our getLinks function we return the web page 
    # (this is useful for searching for the word) 
    # and we return a set of links from that web page 
    # (this is useful for where to go next) 
    while numberVisited < maxPages and pagesToVisit != [] or not foundWord: 
     numberVisited = numberVisited +1 
     # Start from the beginning of our collection of pages to visit: 
     url = pagesToVisit[0] 
     pagesToVisit = pagesToVisit[1:] 
     try: 
      print(numberVisited, "Visiting:", url) 
      parser = LinkParser() 
      data, links = parser.getLinks(url) 
      if data.find(word)>-1: 
       foundWord = True 
       foundAtUrl = url 
       # Add the pages that we visited to the end of our collection 
       # of pages to visit: 
       pagesToVisit = pagesToVisit + links 
       print(" **Success!**") 
      #Added else, so if desired word not found, then make foundWord = False 
      else: 
       foundWord = False 
     except: 
      print(" **Failed!**") 
     #Moved this if-else condition block inside while loop, so for every url, it will give us message whether the desired word found or not 
     if foundWord: 
      print("The word", word, "was found at", url) 
     else: 
      print("Word never found") 

spider("http://www.dreamhost.com", "secure", 200)

請讓我知道，如果您仍然有任何問題/查詢。

來源

2016-08-18 14:38:37

謝謝你的迴應汗。不幸的是，即使使用您提供的「工作代碼」，仍然會遇到原始錯誤。 –

Hello @ EmilioPagan-Yourno，是的，我知道你使用python shell來運行代碼。我更新了我的上面的代碼。它肯定會工作。讓我知道。 –

非常感謝您的幫助。我想到了！ –

Python的網絡爬蟲（NameError：名字「蜘蛛」沒有定義）

回答

相關問題