2013-07-20 98 views
1

我用遞歸函數來取消我的域的所有URL。 但它沒有輸出,沒有任何錯誤。遞歸函數沒有輸出

#usr/bin/python 

from bs4 import BeautifulSoup 
import requests 
import tldextract 


def scrap(url): 

    for links in url: 
     main_domain = tldextract.extract(links) 
     r = requests.get(links) 
     data = r.text 
     soup = BeautifulSoup(data) 

     for href in soup.find_all('a'): 
      href = href.get('href') 
      if not href: 
       continue 
      link_domain = tldextract.extract(href) 

      if link_domain.domain == main_domain.domain : 
       problem.append(href) 

      elif not href == '#' and link_domain.tld == '': 
       new = 'http://www.'+ main_domain.domain + '.' + main_domain.tld + '/' + href 
       problem.append(new) 

     return len(problem) 
     return scrap(problem) 


problem = ["http://xyzdomain.com"] 
print(scrap(problem)) 

當我創建一個新的列表時,它可以工作,但我不想每次都爲每個循環創建一個列表。

+2

請不要打電話給你的清單'清單'。這是一個內置的名稱:http://docs.python.org/2/library/functions.html – NPE

+0

是的,我改變了它,但沒有輸出。 – Alisha

+0

當你將一些調試輸出放入你的'scrap()'[* sic。*]函數時會發生什麼? – Johnsyweb

回答

0

您需要構建您的代碼,使其符合遞歸模式,因爲您當前的代碼沒有 - 您也不應該將變量與庫的名稱相同,例如, href = href.get()因爲這通常會停止該庫的工作,因爲它成爲變量,你的代碼,因爲它是目前將只返回LEN(),因爲這回之前無條件地達到:return scrap(problem)

def Recursive(Factorable_problem) 
    if Factorable_problem is Simplest_Case: 
     return AnswerToSimplestCase 
    else: 
     return Rule_For_Generating_From_Simpler_Case(Recursive(Simpler_Case)) 

例如:

def Factorial(n): 
    """ Recursively Generate Factorials """ 
    if n < 2: 
     return 1 
    else: 
     return n * Factorial(n-1) 
0

你好,我做了一個沒有遞歸版本的這似乎獲得在同一個域的所有鏈接。

下面的代碼我測試過使用代碼中包含的問題。當我已經解決了這個問題,遞歸版本接下來的問題是hitting the recursion depth limit所以我改寫它,所以它在迭代的方式運行,代碼和結果如下:

from bs4 import BeautifulSoup 
import requests 
import tldextract 


def print_domain_info(d): 
    print "Main Domain:{0} \nSub Domain:{1} \nSuffix:{2}".format(d.domain,d.subdomain,d.suffix) 

SEARCHED_URLS = [] 
problem = [ "http://Noelkd.neocities.org/", "http://youpi.neocities.org/"] 
while problem: 
    # Get a link from the stack of links 
    link = problem.pop() 
    # Check we haven't been to this address before 
    if link in SEARCHED_URLS: 
     continue 
    # We don't want to come back here again after this point 
    SEARCHED_URLS.append(link) 
    # Try and get the website 
    try: 
     req = requests.get(link) 
    except: 
     # If its not working i don't care for it 
     print "borked website found: {0}".format(link) 
     continue 
    # Now we get to this point worth printing something 
    print "Trying to parse:{0}".format(link) 
    print "Status Code:{0} Thats: {1}".format(req.status_code, "A-OK" if req.status_code == 200 else "SOMTHINGS UP") 
    # Get the domain info 
    dInfo = tldextract.extract(link) 
    print_domain_info(dInfo) 
    # I like utf-8 
    data = req.text.encode("utf-8") 
    print "Lenght Of Data Retrived:{0}".format(len(data)) # More info 
    soup = BeautifulSoup(data) # This was here before so i left it. 
    print "Found {0} link{1}".format(len(soup.find_all('a')),"s" if len(soup.find_all('a')) > 1 else "") 
    FOUND_THIS_ITERATION = [] # Getting the same links over and over was boring 
    found_links = [x for x in soup.find_all('a') if x.get('href') not in SEARCHED_URLS] # Find me all the links i don't got 
    for href in found_links: 
     href = href.get('href') # You wrote this seems to work well 
     if not href: 
      continue 
     link_domain = tldextract.extract(href) 
     if link_domain.domain == dInfo.domain: # JUST FINDING STUFF ON SAME DOMAIN RIGHT?! 
      if href not in FOUND_THIS_ITERATION: # I'ma check you out next time 
       print "Check out this link: {0}".format(href) 
       print_domain_info(link_domain) 
       FOUND_THIS_ITERATION.append(href) 
       problem.append(href) 
      else: # I got you already 
       print "DUPE LINK!" 
     else: 
      print "Not on same domain moving on" 

    # Count down 
    print "We have {0} more sites to search".format(len(problem)) 
    if problem: 
     continue 
    else: 
     print "Its been fun" 
     print "Lets see the URLS we've visited:" 
     for url in SEARCHED_URLS: 
      print url 

它打印,很多其他的日誌記錄後大量的新聞網站!

發生了什麼事情是該腳本彈出一個尚未訪問的網站列表的值,然後它將獲取該網頁上所有位於同一個域中的鏈接。如果這些鏈接是我們未訪問過的頁面,我們將鏈接添加到要訪問的鏈接列表。在我們這樣做之後,我們彈出下一頁並再次執行相同的操作,直到沒有剩下的頁面可以訪問。

認爲這就是你要找的東西,如果這不符合你的意願或任何人可以改善請留下評論回到我們的評論。

+0

感謝您的幫助。 (y) 不幸的是,當我運行你的代碼時,它給了我一些錯誤 - [鏈接](http://prntscr.com/1gpvbx)。(我已鏈接圖像) 我'新的python和看到你的代碼後,我看到了我的許多錯誤。 – Alisha

+0

你還有這個問題嗎?不知道爲什麼如果你更新與你使用的代碼的問題,我會看看它 – Noelkd