從python網頁獲取鏈接

Hello！我有這個腳本：從python網頁獲取鏈接

URL = "http://www.hitmeister.de/" 

page = urllib2.urlopen(URL).read() 
soup = BeautifulSoup(page) 

links = soup.findAll('a') 

for link in links: 
    print link['href']

這應該得到的網頁鏈接，但它不，可能是什麼問題？我也嘗試使用User-Agent標題，但沒有結果，但此腳本適用於其他網頁。

來源

2012-05-07 user873286

可能要看一看在這個頁面的腳本：http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and -beautiful-soup – 2012-05-07 11:57:52

嘗試過你的腳本，它在添加相關導入（'from bs4 import BeautifulSoup'和'import urllib2'）後適用於我。你正在使用哪個版本的BS？ –

我使用BeautifulSoup 3.2.0-2build1，試過安裝bs4並沒有工作 – user873286

BeautifulSoup有一個非常好的錯誤信息。你有沒有讀過它，並按照它的建議？

/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py:149: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.

"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

Traceback (most recent call last):

File "", line 1, in

File "/Library/Python/2.7/site-packages/bs4/init.py", line 172, in init self._feed()

File "/Library/Python/2.7/site-packages/bs4/init.py", line 185, in _feed self.builder.feed(self.markup)

File "/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py", line 150, in feed raise e

HTMLParser.HTMLParseError: malformed start tag, at line 57, column 872

來源

2012-05-07 12:02:50 jayeff

import urllib 
import lxml.html 
import urlparse 

def get_dom(url): 
    connection = urllib.urlopen(url) 
    return lxml.html.fromstring(connection.read()) 

def get_links(url): 
    return resolve_links((link for link in get_dom(url).xpath('//a/@href'))) 

def guess_root(links): 
    for link in links: 
     if link.startswith('http'): 
      parsed_link = urlparse.urlparse(link) 
      scheme = parsed_link.scheme + '://' 
      netloc = parsed_link.netloc 
      return scheme + netloc 

def resolve_links(links): 
    root = guess_root(links) 
    for link in links: 
     if not link.startswith('http'): 
      link = urlparse.urljoin(root, link) 
     yield link 


for link in get_links('http://www.google.com'): 
    print link

來源

2015-01-21 21:11:24

從python網頁獲取鏈接

回答

相關問題