用美麗的湯颳去內部鏈接

我寫了一個python代碼來獲取與給定url相對應的網頁，並將該網頁上的所有鏈接解析爲鏈接庫。接下來，它從剛剛創建的存儲庫中獲取任何url的內容，將這個新內容中的鏈接解析到存儲庫中，並繼續處理存儲庫中所有鏈接的這個過程，直到停止或獲取給定數量的鏈接之後。用美麗的湯颳去內部鏈接

下面的代碼：

import BeautifulSoup 
import urllib2 
import itertools 
import random 


class Crawler(object): 
"""docstring for Crawler""" 

def __init__(self): 

    self.soup = None          # Beautiful Soup object 
    self.current_page = "http://www.python.org/"   # Current page's address 
    self.links   = set()        # Queue with every links fetched 
    self.visited_links = set() 

    self.counter = 0 # Simple counter for debug purpose 

def open(self): 

    # Open url 
    print self.counter , ":", self.current_page 
    res = urllib2.urlopen(self.current_page) 
    html_code = res.read() 
    self.visited_links.add(self.current_page) 

    # Fetch every links 
    self.soup = BeautifulSoup.BeautifulSoup(html_code) 

    page_links = [] 
    try : 
     page_links = itertools.ifilter( # Only deal with absolute links 
             lambda href: 'http://' in href, 
              (a.get('href') for a in self.soup.findAll('a')) ) 
    except Exception: # Magnificent exception handling 
     pass 



    # Update links 
    self.links = self.links.union(set(page_links)) 



    # Choose a random url from non-visited set 
    self.current_page = random.sample(self.links.difference(self.visited_links),1)[0] 
    self.counter+=1 


def run(self): 

    # Crawl 3 webpages (or stop if all url has been fetched) 
    while len(self.visited_links) < 3 or (self.visited_links == self.links): 
     self.open() 

    for link in self.links: 
     print link 



if __name__ == '__main__': 

C = Crawler() 
C.run()

此代碼不獲取內部鏈接（只絕對建制超鏈接）

如何獲取以「/」或「＃」或」開始的內部鏈接。「拉姆達

來源

2013-10-03 Rohit.nib

好了，你的代碼那種已經告訴你這是怎麼回事。在你的lambda中，你只能獲取以http：//開頭的絕對鏈接（你不會抓住https FWIW）。你應該抓住所有的鏈接，並檢查他們是否以http +開頭。如果它們不是，那麼它們是相對鏈接，並且因爲你知道current_page是什麼，那麼你可以使用它來創建絕對鏈接。

這是對您的代碼的修改。請原諒我的Python，因爲它有點生疏，但我運行它，它爲Python 2.7工作。你會希望把它清理乾淨，並添加一些邊緣/檢錯，但你得到的要點：

#!/usr/bin/python 

from bs4 import BeautifulSoup 
import urllib2 
import itertools 
import random 
import urlparse 


class Crawler(object): 
"""docstring for Crawler""" 

def __init__(self): 
    self.soup = None          # Beautiful Soup object 
    self.current_page = "http://www.python.org/"   # Current page's address 
    self.links   = set()        # Queue with every links fetched 
    self.visited_links = set() 

    self.counter = 0 # Simple counter for debug purpose 

def open(self): 

    # Open url 
    print self.counter , ":", self.current_page 
    res = urllib2.urlopen(self.current_page) 
    html_code = res.read() 
    self.visited_links.add(self.current_page) 

    # Fetch every links 
    self.soup = BeautifulSoup(html_code) 

    page_links = [] 
    try : 
     for link in [h.get('href') for h in self.soup.find_all('a')]: 
      print "Found link: '" + link + "'" 
      if link.startswith('http'): 
       page_links.append(link) 
       print "Adding link" + link + "\n" 
      elif link.startswith('/'): 
       parts = urlparse.urlparse(self.current_page) 
       page_links.append(parts.scheme + '://' + parts.netloc + link) 
       print "Adding link " + parts.scheme + '://' + parts.netloc + link + "\n" 
      else: 
       page_links.append(self.current_page+link) 
       print "Adding link " + self.current_page+link + "\n" 

    except Exception, ex: # Magnificent exception handling 
     print ex 

    # Update links 
    self.links = self.links.union(set(page_links)) 

    # Choose a random url from non-visited set 
    self.current_page = random.sample(self.links.difference(self.visited_links),1)[0] 
    self.counter+=1 

def run(self): 

    # Crawl 3 webpages (or stop if all url has been fetched) 
    while len(self.visited_links) < 3 or (self.visited_links == self.links): 
     self.open() 

    for link in self.links: 
     print link 

if __name__ == '__main__': 
    C = Crawler() 
    C.run()

來源

2013-10-03 20:26:32 FuriousGeorge

我已經知道你對我的建議。我嘗試過，但無法處理所有內部鏈接，所以你可以在代碼中進行更改，並告訴我我必須做什麼 –

CHAGE條件：

page_links = itertools.ifilter( # Only deal with absolute links 
             lambda href: 'http://' in href or href.startswith('/') or href.startswith('#') or href.startswith('.'), 
              (a.get('href') for a in self.soup.findAll('a')) )

來源

2013-10-03 20:22:24

我知道我必須改改了條件，但它應該是什麼，以便它可以處理所有類型的內部鏈接。 –

用美麗的湯颳去內部鏈接

回答

相關問題