0
我有一張HTML表格,需要解析爲一個CSV文件。解析HTML表格
import urllib2, datetime
olddate = datetime.datetime.strptime('5/01/13', "%m/%d/%y")
from BeautifulSoup import BeautifulSoup
print("dates,location,name,url")
def genqry(arga,argb,argc,argd):
return arga + "," + argb + "," + argc + "," + argd
part = 1
row = 1
contenturl = "http://www.robotevents.com/robot-competitions/vex-robotics-competition"
soup = BeautifulSoup(urllib2.urlopen(contenturl).read())
table = soup.find('table', attrs={'class': 'catalog-listing'})
rows = table.findAll('tr')
for tr in rows:
try:
if row != 1:
cols = tr.findAll('td')
for td in cols:
if part == 1:
keep = 0
dates = td.find(text=True)
part = 2
if part == 2:
location = td.find(text=True)
part = 2
if part == 3:
name = td.find(text=True)
for a in tr.findAll('a', href=True):
url = a['href']
# Compare Dates
if len(dates) < 6:
newdate = datetime.datetime.strptime(dates, "%m/%d/%y")
if newdate > olddate:
keep = 1
else:
keep = 0
else:
newdate = datetime.datetime.strptime(dates[:6], "%m/%d/%y")
if newdate > olddate:
keep = 1
else:
keep = 0
if keep == 1:
qry = genqry(dates, location, name, url)
print(qry)
row = row + 1
part = 1
else:
row = row + 1
except (RuntimeError, TypeError, NameError):
print("Error: " + name)
我需要能夠得到5/01/13之後的每個VEX事件。到目前爲止,這段代碼給了我一個關於日期的錯誤,我似乎無法修復。也許有人比我更好可以修復此代碼?先謝了,史密斯。
編輯#1:我認爲我需要在字符串的開頭第一刪除換行符
Value Error: '\n10/5/13' does not match format '%m/%d/%y'
:那我得到是錯誤。 編輯#2:得到它運行,沒有任何輸出,任何幫助?
你不必用美麗的湯爲。您可以使用python3 HTMLParser:https://github.com/schmijos/html-table-parser-python3/blob/master/html_table_parser/parser.py – schmijos