我寫了一個python代碼來獲取與給定url相對應的網頁,並將該網頁上的所有鏈接解析爲鏈接庫。接下來,它從剛剛創建的存儲庫中獲取任何url的內容,將這個新內容中的鏈接解析到存儲庫中,並繼續處理存儲庫中所有鏈接的這個過程,直到停止或獲取給定數量的鏈接之後。用美麗的湯颳去內部鏈接
下面的代碼:
import BeautifulSoup
import urllib2
import itertools
import random
class Crawler(object):
"""docstring for Crawler"""
def __init__(self):
self.soup = None # Beautiful Soup object
self.current_page = "http://www.python.org/" # Current page's address
self.links = set() # Queue with every links fetched
self.visited_links = set()
self.counter = 0 # Simple counter for debug purpose
def open(self):
# Open url
print self.counter , ":", self.current_page
res = urllib2.urlopen(self.current_page)
html_code = res.read()
self.visited_links.add(self.current_page)
# Fetch every links
self.soup = BeautifulSoup.BeautifulSoup(html_code)
page_links = []
try :
page_links = itertools.ifilter( # Only deal with absolute links
lambda href: 'http://' in href,
(a.get('href') for a in self.soup.findAll('a')) )
except Exception: # Magnificent exception handling
pass
# Update links
self.links = self.links.union(set(page_links))
# Choose a random url from non-visited set
self.current_page = random.sample(self.links.difference(self.visited_links),1)[0]
self.counter+=1
def run(self):
# Crawl 3 webpages (or stop if all url has been fetched)
while len(self.visited_links) < 3 or (self.visited_links == self.links):
self.open()
for link in self.links:
print link
if __name__ == '__main__':
C = Crawler()
C.run()
此代碼不獲取內部鏈接(只絕對建制超鏈接)
如何獲取以「/」或「#」或」開始的內部鏈接。 「拉姆達
我已經知道你對我的建議。我嘗試過,但無法處理所有內部鏈接,所以你可以在代碼中進行更改,並告訴我我必須做什麼 –