2010-11-10 36 views
2

我想創造一個簡單的網絡爬蟲的樂趣。我需要網絡爬蟲來獲取一個頁面上所有鏈接的列表。 python庫是否有任何內置的函數可以使這更簡單?感謝任何知識的讚賞。什麼是使用python提取網頁上的URL列表的簡單方法?

回答

7

這實際上很簡單,BeautifulSoup

from BeautifulSoup import BeautifulSoup 

[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)] 

# [u'http://example.com/', u'/example', ...] 

最後一兩件事:你可以使用urlparse.urljoin使所有的網址是絕對的。如果你需要鏈接文本,你可以使用類似element.contents[0]的東西。

而且這裏是你會如何將其結合在一起:

import urllib2 
import urlparse 
from BeautifulSoup import BeautifulSoup 

def get_all_link_targets(url): 
    return [urlparse.urljoin(url, tag['href']) for tag in 
      BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)] 
+1

極好的選擇。 – 2010-11-10 00:25:22

+0

哇,這很酷 – Tom 2010-11-10 03:12:17

0

有使用HTMLParser從網頁上的<a>標籤獲得的網址是an article

的代碼是這樣的:

從HTMLParser的進口HTMLParser的 從urllib2的進口的urlopen

class Spider(HTMLParser): 

    def __init__(self, url): 
     HTMLParser.__init__(self) 
     req = urlopen(url) 
     self.feed(req.read()) 

    def handle_starttag(self, tag, attrs): 
     if tag == 'a' and attrs: 
      print "Found link => %s" % attrs[0][1] 

Spider('http://www.python.org') 

如果運行該腳本,你會得到的輸出是這樣的:

 
[email protected]:~> python crawler.py 
Found link =>/
Found link => #left-hand-navigation 
Found link => #content-body 
Found link => /search 
Found link => /about/ 
Found link => /news/ 
Found link => /doc/ 
Found link => /download/ 
Found link => /community/ 
Found link => /psf/ 
Found link => /dev/ 
Found link => /about/help/ 
Found link => http://pypi.python.org/pypi 
Found link => /download/releases/2.7/ 
Found link => http://docs.python.org/ 
Found link => /ftp/python/2.7/python-2.7.msi 
Found link => /ftp/python/2.7/Python-2.7.tar.bz2 
Found link => /download/releases/3.1.2/ 
Found link => http://docs.python.org/3.1/ 
Found link => /ftp/python/3.1.2/python-3.1.2.msi 
Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2 
Found link => /community/jobs/ 
Found link => /community/merchandise/ 
Found link => margin-top:1.5em 
Found link => margin-top:1.5em 
Found link => margin-top:1.5em 
Found link => color:#D58228; margin-top:1.5em 
Found link => /psf/donations/ 
Found link => http://wiki.python.org/moin/Languages 
Found link => http://wiki.python.org/moin/Languages 
Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics 
Found link => http://wiki.python.org/moin/Python2orPython3 
Found link => http://pypi.python.org/pypi 
Found link => /3kpoll 
Found link => /about/success/usa/ 
Found link => reference 
Found link => reference 
Found link => reference 
Found link => reference 
Found link => reference 
Found link => reference 
Found link => /about/quotes 
Found link => http://wiki.python.org/moin/WebProgramming 
Found link => http://wiki.python.org/moin/CgiScripts 
Found link => http://www.zope.org/ 
Found link => http://www.djangoproject.com/ 
Found link => http://www.turbogears.org/ 
Found link => http://wiki.python.org/moin/PythonXml 
Found link => http://wiki.python.org/moin/DatabaseProgramming/ 
Found link => http://www.egenix.com/files/python/mxODBC.html 
Found link => http://sourceforge.net/projects/mysql-python 
Found link => http://wiki.python.org/moin/GuiProgramming 
Found link => http://wiki.python.org/moin/WxPython 
Found link => http://wiki.python.org/moin/TkInter 
Found link => http://wiki.python.org/moin/PyGtk 
Found link => http://wiki.python.org/moin/PyQt 
Found link => http://wiki.python.org/moin/NumericAndScientific 
Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html 
Found link => http://www.pentangle.net/python/handbook/ 
Found link => /community/sigs/current/edu-sig 
Found link => http://www.openbookproject.net/pybiblio/ 
Found link => http://osl.iu.edu/~lums/swc/ 
Found link => /about/apps 
Found link => http://docs.python.org/howto/sockets.html 
Found link => http://twistedmatrix.com/trac/ 
Found link => /about/apps 
Found link => http://buildbot.net/trac 
Found link => http://www.edgewall.com/trac/ 
Found link => http://roundup.sourceforge.net/ 
Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments 
Found link => /about/apps 
Found link => http://www.pygame.org/news.html 
Found link => http://www.alobbs.com/pykyra 
Found link => http://www.vrplumber.com/py3d.py 
Found link => /about/apps 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => reference external 
Found link => /channews.rdf 
Found link => /about/website 
Found link => http://www.xs4all.com/ 
Found link => http://www.timparkin.co.uk/ 
Found link => /psf/ 
Found link => /about/legal 

您可以使用正則表達式來區分絕對URL和相對URL。

0

使用libxml完成​​的解決方案。

import urllib 
import libxml2 
parse_opts = libxml2.HTML_PARSE_RECOVER + \ 
      libxml2.HTML_PARSE_NOERROR + \ 
      libxml2.HTML_PARSE_NOWARNING 

doc = libxml2.htmlReadDoc(urllib.urlopen(url).read(), '', None, parse_opts) 
print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ] 
相關問題