我想創造一個簡單的網絡爬蟲的樂趣。我需要網絡爬蟲來獲取一個頁面上所有鏈接的列表。 python庫是否有任何內置的函數可以使這更簡單?感謝任何知識的讚賞。什麼是使用python提取網頁上的URL列表的簡單方法?
2
A
回答
7
這實際上很簡單,BeautifulSoup。
from BeautifulSoup import BeautifulSoup
[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)]
# [u'http://example.com/', u'/example', ...]
最後一兩件事:你可以使用urlparse.urljoin
使所有的網址是絕對的。如果你需要鏈接文本,你可以使用類似element.contents[0]
的東西。
而且這裏是你會如何將其結合在一起:
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
def get_all_link_targets(url):
return [urlparse.urljoin(url, tag['href']) for tag in
BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)]
+1
極好的選擇。 – 2010-11-10 00:25:22
+0
哇,這很酷 – Tom 2010-11-10 03:12:17
0
有使用HTMLParser從網頁上的<a>
標籤獲得的網址是an article。
的代碼是這樣的:
從HTMLParser的進口HTMLParser的 從urllib2的進口的urlopen
class Spider(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
req = urlopen(url)
self.feed(req.read())
def handle_starttag(self, tag, attrs):
if tag == 'a' and attrs:
print "Found link => %s" % attrs[0][1]
Spider('http://www.python.org')
如果運行該腳本,你會得到的輸出是這樣的:
[email protected]:~> python crawler.py Found link =>/ Found link => #left-hand-navigation Found link => #content-body Found link => /search Found link => /about/ Found link => /news/ Found link => /doc/ Found link => /download/ Found link => /community/ Found link => /psf/ Found link => /dev/ Found link => /about/help/ Found link => http://pypi.python.org/pypi Found link => /download/releases/2.7/ Found link => http://docs.python.org/ Found link => /ftp/python/2.7/python-2.7.msi Found link => /ftp/python/2.7/Python-2.7.tar.bz2 Found link => /download/releases/3.1.2/ Found link => http://docs.python.org/3.1/ Found link => /ftp/python/3.1.2/python-3.1.2.msi Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2 Found link => /community/jobs/ Found link => /community/merchandise/ Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => color:#D58228; margin-top:1.5em Found link => /psf/donations/ Found link => http://wiki.python.org/moin/Languages Found link => http://wiki.python.org/moin/Languages Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics Found link => http://wiki.python.org/moin/Python2orPython3 Found link => http://pypi.python.org/pypi Found link => /3kpoll Found link => /about/success/usa/ Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => /about/quotes Found link => http://wiki.python.org/moin/WebProgramming Found link => http://wiki.python.org/moin/CgiScripts Found link => http://www.zope.org/ Found link => http://www.djangoproject.com/ Found link => http://www.turbogears.org/ Found link => http://wiki.python.org/moin/PythonXml Found link => http://wiki.python.org/moin/DatabaseProgramming/ Found link => http://www.egenix.com/files/python/mxODBC.html Found link => http://sourceforge.net/projects/mysql-python Found link => http://wiki.python.org/moin/GuiProgramming Found link => http://wiki.python.org/moin/WxPython Found link => http://wiki.python.org/moin/TkInter Found link => http://wiki.python.org/moin/PyGtk Found link => http://wiki.python.org/moin/PyQt Found link => http://wiki.python.org/moin/NumericAndScientific Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html Found link => http://www.pentangle.net/python/handbook/ Found link => /community/sigs/current/edu-sig Found link => http://www.openbookproject.net/pybiblio/ Found link => http://osl.iu.edu/~lums/swc/ Found link => /about/apps Found link => http://docs.python.org/howto/sockets.html Found link => http://twistedmatrix.com/trac/ Found link => /about/apps Found link => http://buildbot.net/trac Found link => http://www.edgewall.com/trac/ Found link => http://roundup.sourceforge.net/ Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments Found link => /about/apps Found link => http://www.pygame.org/news.html Found link => http://www.alobbs.com/pykyra Found link => http://www.vrplumber.com/py3d.py Found link => /about/apps Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => /channews.rdf Found link => /about/website Found link => http://www.xs4all.com/ Found link => http://www.timparkin.co.uk/ Found link => /psf/ Found link => /about/legal
您可以使用正則表達式來區分絕對URL和相對URL。
0
使用libxml完成的解決方案。
import urllib
import libxml2
parse_opts = libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNING
doc = libxml2.htmlReadDoc(urllib.urlopen(url).read(), '', None, parse_opts)
print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ]
相關問題
- 1. 在沒有BeautifulSoup的情況下使用python提取網頁上鍊接的最簡單方法是什麼?
- 2. 在C中抓取網頁的最簡單方法是什麼?
- 3. 在網站上提交表單最簡單的方法是什麼?
- 4. Python - 什麼是更新列表中元素的簡單方法?
- 5. 使用Wordpress獲取分層頁面網址的最簡單方法是什麼?
- 6. 從Scala期貨列表中提取成功的最簡單方法是什麼?
- 7. 在Java中拉取JSON URL最簡單的方法是什麼?
- 8. 什麼是鎖定網頁內容的簡單方法?
- 9. 從一系列網頁中提取數據最簡單的是什麼?
- 10. 使用Python進行SSH的最簡單方法是什麼?
- 11. Python - 使用BeautifulSoup從URL列表中刮取文本的最簡單方法
- 12. 什麼是製作自我提取PE的最簡單方法?
- 13. 從PDF中提取數據的最簡單方法是什麼?
- 14. 從網址獲取號碼的最簡單方法是什麼?
- 15. 在互聯網上收費的最簡單方法是什麼?
- 16. 在Python中創建表格最簡單的方法是什麼?
- 17. 從StreamInsight獲取輸出到ASP.NET網頁的簡單方法是什麼?
- 18. 在C中抓取網頁的最簡單方法是什麼? (通過HTTPS)
- 19. 什麼是mailto's URL網頁上的
- 20. 獲取ActiveMQ隊列長度的簡單方法是什麼?
- 21. 在python中迭代嵌套列表的更簡單的方法是什麼?
- 22. 使用JQuery在網站上顯示RSS源的最簡單方法是什麼?
- 23. 什麼是使網站響應最簡單的方法?
- 24. 只使用sed或awk從html頁面中提取url最簡單的方法
- 25. 比較兩個列表最簡單的方法是什麼?
- 26. 什麼是瀏覽圖像列表最簡單的方法?
- 27. 在Python列表上進行sort和uniq的最簡潔的方法是什麼?
- 28. 以編程方式從一堆網頁中提取結構化數據的最簡單方法是什麼?
- 29. 用Python安裝OpenVAS omblib最簡單的方法是什麼?
- 30. 用python執行WHOIS協議最簡單的方法是什麼?
有沒有簡單的方法來做到這一點。 HTML解析並不容易。 – Falmarri 2010-11-10 00:11:11
您可以嘗試使用'HTMLParser'處理程序處理''標記,但是您可能不會以這種方式捕捉每個網址,並且實際上從「」標記獲取網址可能需要一些魔力。 – 2010-11-10 00:15:19
有很簡單的方法來做非常複雜的事情;這就是圖書館的用途。 – 2010-11-10 00:38:47