2013-03-20 42 views
9

我必須在Python中編寫Web爬網程序。我不知道如何解析頁面並從HTML中提取URL。我應該去哪裏學習寫這樣一個計劃?如何從Python中的HTML頁面中提取URL

換句話說,是否有一個簡單的python程序可以用作通用網絡爬蟲的模板?理想情況下,它應該使用相對簡單的模塊,並且應該包含大量的註釋來描述每行代碼的功能。

回答

16

請看下面的示例代碼。該腳本提取網頁的HTML代碼(這裏是Python主頁),並提取該頁面中的所有鏈接。希望這可以幫助。

#!/usr/bin/env python 

import requests 
from BeautifulSoup import BeautifulSoup 

url = "http://www.python.org" 
response = requests.get(url) 
# parse html 
page = str(BeautifulSoup(response.content)) 


def getURL(page): 
    """ 

    :param page: html of web page (here: Python home page) 
    :return: urls in that page 
    """ 
    start_link = page.find("a href") 
    if start_link == -1: 
     return None, 0 
    start_quote = page.find('"', start_link) 
    end_quote = page.find('"', start_quote + 1) 
    url = page[start_quote + 1: end_quote] 
    return url, end_quote 

while True: 
    url, n = getURL(page) 
    page = page[n:] 
    if url: 
     print url 
    else: 
     break 

輸出:

/ 
#left-hand-navigation 
#content-body 
/search 
/about/ 
/news/ 
/doc/ 
/download/ 
/getit/ 
/community/ 
/psf/ 
http://docs.python.org/devguide/ 
/about/help/ 
http://pypi.python.org/pypi 
/download/releases/2.7.3/ 
http://docs.python.org/2/ 
/ftp/python/2.7.3/python-2.7.3.msi 
/ftp/python/2.7.3/Python-2.7.3.tar.bz2 
/download/releases/3.3.0/ 
http://docs.python.org/3/ 
/ftp/python/3.3.0/python-3.3.0.msi 
/ftp/python/3.3.0/Python-3.3.0.tar.bz2 
/community/jobs/ 
/community/merchandise/ 
/psf/donations/ 
http://wiki.python.org/moin/Languages 
http://wiki.python.org/moin/Languages 
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics 
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics 
http://pycon.org/#calendar 
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics 
http://pycon.org/#calendar 
http://www.psfmember.org 

...

3

您可以使用beautifulsoup提取的HTML的URL。按照文檔,看看什麼符合您的要求。該文檔還包含了如何提取URL的代碼片段。

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc) 

soup.find_all('a') # Finds all hrefs from the html doc. 
5
import sys 
import re 
import urllib2 
import urlparse 
tocrawl = set(["http://www.facebook.com/"]) 
crawled = set([]) 
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>') 
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') 

while 1: 
    try: 
     crawling = tocrawl.pop() 
     print crawling 
    except KeyError: 
     raise StopIteration 
    url = urlparse.urlparse(crawling) 
    try: 
     response = urllib2.urlopen(crawling) 
    except: 
     continue 
    msg = response.read() 
    startPos = msg.find('<title>') 
    if startPos != -1: 
     endPos = msg.find('</title>', startPos+7) 
     if endPos != -1: 
      title = msg[startPos+7:endPos] 
      print title 
    keywordlist = keywordregex.findall(msg) 
    if len(keywordlist) > 0: 
     keywordlist = keywordlist[0] 
     keywordlist = keywordlist.split(", ") 
     print keywordlist 
    links = linkregex.findall(msg) 
    crawled.add(crawling) 
    for link in (links.pop(0) for _ in xrange(len(links))): 
     if link.startswith('/'): 
      link = 'http://' + url[1] + link 
     elif link.startswith('#'): 
      link = 'http://' + url[1] + url[2] + link 
     elif not link.startswith('http'): 
      link = 'http://' + url[1] + '/' + link 
     if link not in crawled: 
      tocrawl.add(link) 

參考:Python Web Crawler in Less Than 50 Lines(慢或不再工作,不加載對我來說)

12

您可以使用BeautifulSoup正如許多人還指出。它可以解析HTML,XML等。要查看它的一些功能,請參閱here

實施例:

import urllib2 
from bs4 import BeautifulSoup 
url = 'http://www.google.co.in/' 

conn = urllib2.urlopen(url) 
html = conn.read() 

soup = BeautifulSoup(html) 
links = soup.find_all('a') 

for tag in links: 
    link = tag.get('href',None) 
    if link is not None: 
     print link