我正在嘗試使用BeautifulSoup來編寫一個Python腳本,抓取網頁http://tbc-python.fossee.in/completed-books/並從中收集必要的數據。基本上它必須將所有書籍的章節中的所有page loading errors, SyntaxErrors, NameErrors, AttributeErrors, etc
提取到文本文件errors.txt
。大約有273本書。編寫的腳本很好地完成了任務。我以很快的速度使用帶寬。但是代碼需要花費很多時間才能瀏覽所有書籍。請幫我必要的修改,以優化python腳本,也許使用的功能,等等感謝使用函數修改python網絡抓取代碼的提示
import urllib2, urllib
from bs4 import BeautifulSoup
website = "http://tbc-python.fossee.in/completed-books/"
soup = BeautifulSoup(urllib2.urlopen(website))
errors = open('errors.txt','w')
# Completed books webpage has data stored in table format
BookTable = soup.find('table', {'class': 'table table-bordered table-hover'})
for BookCount, BookRow in enumerate(BookTable.find_all('tr'), start = 1):
# Grab book names
BookCol = BookRow.find_all('td')
BookName = BookCol[1].a.string.strip()
print "%d: %s" % (BookCount, BookName)
# Open each book
BookSrc = BeautifulSoup(urllib2.urlopen('http://tbc-python.fossee.in%s' %(BookCol[1].a.get("href"))))
ChapTable = BookSrc.find('table', {'class': 'table table-bordered table-hover'})
# Check if each chapter page opens, if not store book & chapter name in error.txt
for ChapRow in ChapTable.find_all('tr'):
ChapCol = ChapRow.find_all('td')
ChapName = (ChapCol[0].a.string.strip()).encode('ascii', 'ignore') # ignores error : 'ascii' codec can't encode character u'\xef'
ChapLink = 'http://tbc-python.fossee.in%s' %(ChapCol[0].a.get("href"))
try:
ChapSrc = BeautifulSoup(urllib2.urlopen(ChapLink))
except:
print '\t%s\n\tPage error' %(ChapName)
errors.write("Page; %s;%s;%s;%s" %(BookCount, BookName, ChapName, ChapLink))
continue
# Check for errors in chapters and store the errors in error.txt
EgError = ChapSrc.find_all('div', {'class': 'output_subarea output_text output_error'})
if EgError:
for e, i in enumerate(EgError, start=1):
errors.write("Example;%s;%s;%s;%s\n" %(BookCount,BookName,ChapName,ChapLink)) if 'ipython-input' or 'Error' in i.pre.get_text() else None
print '\t%s\n\tExample errors: %d' %(ChapName, e)
errors.close()
@ OneOfOne。我一次使用1個連接。還有其他建議嗎?謝謝。 –
@ThirumaleshHS我沒有看到讓它更快分解的方法,但也許別人會這麼做。祝你好運。 – OneOfOne