如何遞歸查找來自網頁與美麗的所有鏈接？

我一直在嘗試使用一些代碼，我發現in this answer遞歸找到一個給定的URL的所有鏈接：如何遞歸查找來自網頁與美麗的所有鏈接？

import urllib2 
from bs4 import BeautifulSoup 

url = "http://francaisauthentique.libsyn.com/" 

def recursiveUrl(url,depth): 

    if depth == 5: 
     return url 
    else: 
     page=urllib2.urlopen(url) 
     soup = BeautifulSoup(page.read()) 
     newlink = soup.find('a') #find just the first one 
     if len(newlink) == 0: 
      return url 
     else: 
      return url, recursiveUrl(newlink,depth+1) 


def getLinks(url): 
    page=urllib2.urlopen(url) 
    soup = BeautifulSoup(page.read()) 
    links = soup.find_all('a') 
    for link in links: 
     links.append(recursiveUrl(link,0)) 
    return links 

links = getLinks(url) 
print(links)

再說警告

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. 

The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this: 

BeautifulSoup(YOUR_MARKUP}) 

to this: 

BeautifulSoup(YOUR_MARKUP, "lxml")

我收到以下錯誤：

Traceback (most recent call last): 
    File "downloader.py", line 28, in <module> 
    links = getLinks(url) 
    File "downloader.py", line 25, in getLinks 
    links.append(recursiveUrl(link,0)) 
    File "downloader.py", line 11, in recursiveUrl 
    page=urllib2.urlopen(url) 
    File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen 
    return _opener.open(url, data, timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 396, in open 
    protocol = req.get_type() 
TypeError: 'NoneType' object is not callable

問題是什麼？

來源

2017-10-08 Alex

我想你傳遞一個BeautifulSoup對象'urlopen'，而不是URL。試試類似'link ['href']'，但一定要檢查它是否在第一位。 – Thomas

謝謝托馬斯，但現在我收到一個錯誤「ValueError：unknown url type：/ webpage/categery/general」。也許是因爲這是一個相對的鏈接而不是絕對的鏈接？ – Alex

@Alex正確：） –

您的recursiveUrl會嘗試訪問一個無效的url鏈接，如：/ webpage/category/general，這是您從某個href鏈接提取的值。

您應該將提取的href值附加到網站的網址，然後嘗試打開網頁。您將需要處理遞歸算法，因爲我不知道您想要實現什麼。

代碼：

import requests 
from bs4 import BeautifulSoup 

def recursiveUrl(url, link, depth): 
    if depth == 5: 
     return url 
    else: 
     print(link['href']) 
     page = requests.get(url + link['href']) 
     soup = BeautifulSoup(page.text, 'html.parser') 
     newlink = soup.find('a') 
     if len(newlink) == 0: 
      return link 
     else: 
      return link, recursiveUrl(url, newlink, depth + 1) 

def getLinks(url): 
    page = requests.get(url) 
    soup = BeautifulSoup(page.text, 'html.parser') 
    links = soup.find_all('a') 
    for link in links: 
     links.append(recursiveUrl(url, link, 0)) 
    return links 

links = getLinks("http://francaisauthentique.libsyn.com/") 
print(links)

輸出：

http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/10 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/09 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/08 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/07 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general

來源

2017-10-08 17:53:40 Ali

如何遞歸查找來自網頁與美麗的所有鏈接？

回答

相關問題