我必須在Python中編寫Web爬網程序。我不知道如何解析頁面並從HTML中提取URL。我應該去哪裏學習寫這樣一個計劃?如何從Python中的HTML頁面中提取URL
換句話說,是否有一個簡單的python程序可以用作通用網絡爬蟲的模板?理想情況下,它應該使用相對簡單的模塊,並且應該包含大量的註釋來描述每行代碼的功能。
我必須在Python中編寫Web爬網程序。我不知道如何解析頁面並從HTML中提取URL。我應該去哪裏學習寫這樣一個計劃?如何從Python中的HTML頁面中提取URL
換句話說,是否有一個簡單的python程序可以用作通用網絡爬蟲的模板?理想情況下,它應該使用相對簡單的模塊,並且應該包含大量的註釋來描述每行代碼的功能。
請看下面的示例代碼。該腳本提取網頁的HTML代碼(這裏是Python主頁),並提取該頁面中的所有鏈接。希望這可以幫助。
#!/usr/bin/env python
import requests
from BeautifulSoup import BeautifulSoup
url = "http://www.python.org"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))
def getURL(page):
"""
:param page: html of web page (here: Python home page)
:return: urls in that page
"""
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print url
else:
break
輸出:
/
#left-hand-navigation
#content-body
/search
/about/
/news/
/doc/
/download/
/getit/
/community/
/psf/
http://docs.python.org/devguide/
/about/help/
http://pypi.python.org/pypi
/download/releases/2.7.3/
http://docs.python.org/2/
/ftp/python/2.7.3/python-2.7.3.msi
/ftp/python/2.7.3/Python-2.7.3.tar.bz2
/download/releases/3.3.0/
http://docs.python.org/3/
/ftp/python/3.3.0/python-3.3.0.msi
/ftp/python/3.3.0/Python-3.3.0.tar.bz2
/community/jobs/
/community/merchandise/
/psf/donations/
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.psfmember.org
...
隨着解析頁面,檢查出BeautifulSoup
模塊。它使用起來很簡單,並允許您使用HTML解析頁面。你可以簡單地做
您可以使用beautifulsoup提取的HTML的URL。按照文檔,看看什麼符合您的要求。該文檔還包含了如何提取URL的代碼片段。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find_all('a') # Finds all hrefs from the html doc.
import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
while 1:
try:
crawling = tocrawl.pop()
print crawling
except KeyError:
raise StopIteration
url = urlparse.urlparse(crawling)
try:
response = urllib2.urlopen(crawling)
except:
continue
msg = response.read()
startPos = msg.find('<title>')
if startPos != -1:
endPos = msg.find('</title>', startPos+7)
if endPos != -1:
title = msg[startPos+7:endPos]
print title
keywordlist = keywordregex.findall(msg)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")
print keywordlist
links = linkregex.findall(msg)
crawled.add(crawling)
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.add(link)
參考:Python Web Crawler in Less Than 50 Lines(慢或不再工作,不加載對我來說)
您可以使用BeautifulSoup正如許多人還指出。它可以解析HTML,XML等。要查看它的一些功能,請參閱here。
實施例:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'
conn = urllib2.urlopen(url)
html = conn.read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
for tag in links:
link = tag.get('href',None)
if link is not None:
print link