2013-08-02 127 views
2

我跟隨this tutorial關於Python中的網絡爬蟲。我已經設法啓動並運行我的代碼,但是我所遇到的問題以及上述視頻不會發生的情況是,我將位於代碼末尾的print scraper(url,7)中的數字增加到8或更大的I 「M殼得到一個以下錯誤:Python網絡爬蟲,深度問題

Traceback (most recent call last): 
File "<pyshell#30>", line 1, in <module> 
    execfile("threads/mechanizex.py") 
File "threads/mechanizex.py", line 85, in <module> 
    print scraper(url,7) 
File "threads/mechanizex.py", line 21, in scraper 
    for u in step_url: 
TypeError: 'NoneType' object is not iterable 

而且我不知道什麼是我的問題,因爲我有完全相同的代碼作爲視頻的作者,他增加了他的電話號碼多達13個,並得到結果鏈接而我不能增加超過7.這是我的整個代碼:

import urllib 
import re 
import time 
from threading import Thread 
import MySQLdb 
import mechanize 
import readability 
from bs4 import BeautifulSoup 
from readability.readability import Document 
import urlparse 

url = "http://adbnews.com/area51" 

def scraper(root,steps): 
    urls = [root] 
    visited = [root] 
    counter = 0 
    while counter < steps: 
     step_url = scrapeStep(urls) 
     urls = [] 
     for u in step_url: 
      if u not in visited: 
       urls.append(u) 
       visited.append(u) 
     counter +=1 

    return visited 

def scrapeStep(root): 
    result_urls = [] 
    br = mechanize.Browser() 
    br.set_handle_robots(False) 
    br.addheaders = [('User-agent', 'Firefox')] 

    for url in root: 
     try: 
      br.open(url) 
      for link in br.links(): 
       newurl = urlparse.urljoin(link.base_url, link.url) 
       result_urls.append(newurl) 
     except: 
      print "error" 
     return result_urls 

d = {} 
threadlist = [] 

def getReadableArticle(url): 
    br = mechanize.Browser() 
    br.set_handle_robots(False) 
    br.addheaders = [('User-agent', 'Firefox')] 

    html = br.open(url).read() 

    readable_article = Document(html).summary() 
    readable_title = Document(html).short_title() 

    soup = BeautifulSoup(readable_article) 

    final_article = soup.text 

    links = soup.findAll('img', src=True) 

    return readable_title 
    return final_article 

def dungalo(urls): 
    article_text = getReadableArticle(urls)[0] 
    d[urls] = article_text 

def getMultiHtml(urlsList): 
    for urlsl in urlsList: 
     try: 
      t = Thread(target=dungalo, args=(urls1,)) 
      threadlist.append(t) 
      t.start() 
     except: 
      nnn = True 

    for g in threadlist: 
     g.join() 

    return d 

print scraper(url,7) 

幫助任何人嗎?

回答

5

您的縮進是錯誤的。它必須是合理的:

def scrapeStep(root): 
    result_urls = [] 
    br = mechanize.Browser() 
    br.set_handle_robots(False) 
    br.addheaders = [('User-agent', 'Firefox')] 

    for url in root: 
     try: 
      br.open(url) 
      for link in br.links(): 
       newurl = urlparse.urljoin(link.base_url, link.url) 
       result_urls.append(newurl) 
     except: 
      print "error" 

    return result_urls 

否則,它只查看給定的第一個URL,如果沒有給出URL,也返回None。

+0

是的,就是這樣,我沒有注意到intendation錯誤....謝謝! :) – dzordz