從網頁提取所有鏈接

-3

我想寫一個功能，需要一個網頁的URL，下載網頁，並返回該網頁的URL列表（使用urllib模塊）任何幫助，將不勝感激從網頁提取所有鏈接

2011-05-01 matt

你有什麼這麼遠嗎？你有什麼具體問題？ – Mat 2011-05-01 11:15:29

這個問題有多差？ – 2011-05-01 11:19:08

我們不會爲你做你的功課。 – 2011-05-01 11:29:17

在這裏你去：

import sys 
import urllib2 
import lxml.html 

try: 
    url = sys.argv[1] 
except IndexError: 
    print "Specify a url to scrape" 
    sys.exit(1) 

if not url.startswith("http://"): 
    print "Please include the http:// at the beginning of the url" 
    sys.exit(1) 

html = urllib2.urlopen(url).read() 
etree = lxml.html.fromstring(html) 

for href in etree.xpath("//a/@href"): 
    print href

 
C:\Programming>getlinks.py http://example.com 
/
/domains/ 
/numbers/ 
/protocols/ 
/about/ 
/go/rfc2606 
/about/ 
/about/presentations/ 
/about/performance/ 
/reports/ 
/domains/ 
/domains/root/ 
/domains/int/ 
/domains/arpa/ 
/domains/idn-tables/ 
/protocols/ 
/numbers/ 
/abuse/ 
http://www.icann.org/ 
mailto:[email protected]?subject=General%20website%20feedback

來源

2011-05-01 11:34:50 Acorn

+1對於lxml來說就是這個意思。 – 2011-05-01 11:48:47

我必須使用urllib模塊 – matt 2011-05-01 12:47:24

我編輯腳本以使用urllib2單獨下載頁面。 – Acorn 2011-05-01 12:57:53

從網頁提取所有鏈接

回答

相關問題