2012-08-16 75 views
-2

我在Python中構建基本蜘蛛程序時遇到問題。每當我嘗試運行時,我都會遇到錯誤。該錯誤發生在最後七行代碼中的某處。基本蜘蛛程序不會運行

#These modules do most of the work. 
import sys 
import urllib2 
import urlparse 
import htmllib, formatter 
from cStringIO import StringIO 


def log_stdout(msg): 
    """Print msg to the screen.""" 
    print msg 

def get_page(url, log): 
    """Retrieve URL and return contents, log errors.""" 
    try: 
     page = urllib2.urlopen(url) 
    except urllib2.URLError: 
     log("Error retrieving: " + url) 
     return '' 
    body = page.read() 
    page.close() 
    return body 

def find_links(html): 
    """Return a list links in html.""" 
    # We're using the parser just to get the HREFs 
    writer = formatter.DumbWriter(StringIO()) 
    f = formatter.AbstractFormatter(writer) 
    parser = htmllib.HTMLParser(f) 
    parser.feed(html) 
    parser.close() 
    return parser.anchorlist 

class Spider: 

    """ 
    The heart of this program, finds all links within a web site. 

    run() contains the main loop. 
    process_page() retrieves each page and finds the links. 
    """ 

    def __init__(self, startURL, log=None): 
      #This method sets initial values 
     self.URLs = set() 
     self.URLs.add(startURL) 
     self.include = startURL 
     self._links_to_process = [startURL] 
     if log is None: 
      # Use log_stdout function if no log provided 
      self.log = log_stdout 
     else: 
      self.log = log 

    def run(self): 
     #Processes list of URLs one at a time 
     while self._links_to_process: 
      url = self._links_to_process.pop() 
      self.log("Retrieving: " + url) 
      self.process_page(url) 

    def url_in_site(self, link): 
     #Checks whether the link starts with the base URL 
     return link.startswith(self.include) 

    def process_page(self, url): 
     #Retrieves page and finds links in it 
     html = get_page(url, self.log) 
     for link in find_links(html): 
      #Handle relative links 
      link = urlparse.urljoin(url, link) 
      self.log("Checking: " + link) 
      # Make sure this is a new URL within current site 
      if link not in self.URLs and self.url_in_site(link): 
       self.URLs.add(link) 
       self._links_to_process.append(link) 

錯誤信息與此代碼塊有關。

if __name__ == '__main__': 
    #This code runs when script is started from command line 
    startURL = sys.argv[1] 
    spider = Spider(startURL) 
    spider.run() 
    for URL in sorted(spider.URLs): 
      print URL 


The error message: 
     startURL = sys.argv[1] 
    IndexError: list index out of range 

回答

3

你沒有用參數調用你的蜘蛛程序。 sys.argv[0]是您的腳本文件,並且sys.argv[1]將是您傳遞它的第一個參數。 「列表索引超出範圍」意味着你沒有給出任何論點。

嘗試將其稱爲python spider.py http://www.example.com(包含您的實際網址)。

0

這並不直接回答你的問題,而是:

我會去的東西如:

START_PAGE = 'http://some.url.tld' 
ahrefs = lxml.html.parse(START_PAGE).getroottree('//a/@href') 

然後使用可用的方法上lmxl.html對象和multiprocess鏈接

該手柄「半格式化」的HTML,並且你可以插入BeautifulSoup庫。

如果你想嘗試嘗試跟隨JavaScript生成的鏈接,那麼需要一點工作,但是 - 這就是生活!