2012-05-07 100 views
0

Hello!我有這個腳本:從python網頁獲取鏈接

URL = "http://www.hitmeister.de/" 

page = urllib2.urlopen(URL).read() 
soup = BeautifulSoup(page) 

links = soup.findAll('a') 

for link in links: 
    print link['href'] 

這應該得到的網頁鏈接,但它不,可能是什麼問題?我也嘗試使用User-Agent標題,但沒有結果,但此腳本適用於其他網頁。

+0

可能要看一看在這個頁面的腳本:http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and -beautiful-soup – 2012-05-07 11:57:52

+0

嘗試過你的腳本,它在添加相關導入('from bs4 import BeautifulSoup'和'import urllib2')後適用於我。你正在使用哪個版本的BS? –

+0

我使用BeautifulSoup 3.2.0-2build1,試過安裝bs4並沒有工作 – user873286

回答

3

BeautifulSoup有一個非常好的錯誤信息。你有沒有讀過它,並按照它的建議?

/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py:149: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.

"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

Traceback (most recent call last):

File "", line 1, in

File "/Library/Python/2.7/site-packages/bs4/init.py", line 172, in init self._feed()

File "/Library/Python/2.7/site-packages/bs4/init.py", line 185, in _feed self.builder.feed(self.markup)

File "/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py", line 150, in feed raise e

HTMLParser.HTMLParseError: malformed start tag, at line 57, column 872

0
import urllib 
import lxml.html 
import urlparse 

def get_dom(url): 
    connection = urllib.urlopen(url) 
    return lxml.html.fromstring(connection.read()) 

def get_links(url): 
    return resolve_links((link for link in get_dom(url).xpath('//a/@href'))) 

def guess_root(links): 
    for link in links: 
     if link.startswith('http'): 
      parsed_link = urlparse.urlparse(link) 
      scheme = parsed_link.scheme + '://' 
      netloc = parsed_link.netloc 
      return scheme + netloc 

def resolve_links(links): 
    root = guess_root(links) 
    for link in links: 
     if not link.startswith('http'): 
      link = urlparse.urljoin(root, link) 
     yield link 


for link in get_links('http://www.google.com'): 
    print link