解析HTML表格

我有一張HTML表格，需要解析爲一個CSV文件。解析HTML表格

import urllib2, datetime 
olddate = datetime.datetime.strptime('5/01/13', "%m/%d/%y") 
from BeautifulSoup import BeautifulSoup 
print("dates,location,name,url") 
def genqry(arga,argb,argc,argd): 
return arga + "," + argb + "," + argc + "," + argd 
part = 1 
row = 1 
contenturl = "http://www.robotevents.com/robot-competitions/vex-robotics-competition" 
soup = BeautifulSoup(urllib2.urlopen(contenturl).read()) 
table = soup.find('table', attrs={'class': 'catalog-listing'}) 
rows = table.findAll('tr') 
for tr in rows: 
    try: 
     if row != 1: 
      cols = tr.findAll('td') 
      for td in cols: 
       if part == 1: 
        keep = 0 
        dates = td.find(text=True) 
        part = 2 
       if part == 2: 
        location = td.find(text=True) 
        part = 2 
       if part == 3: 
        name = td.find(text=True) 
        for a in tr.findAll('a', href=True): 
         url = a['href'] 
       # Compare Dates 
       if len(dates) < 6: 
        newdate = datetime.datetime.strptime(dates, "%m/%d/%y") 
        if newdate > olddate: 
         keep = 1 
        else: 
         keep = 0 
       else: 
        newdate = datetime.datetime.strptime(dates[:6], "%m/%d/%y") 
        if newdate > olddate: 
         keep = 1 
        else: 
         keep = 0 
       if keep == 1: 
        qry = genqry(dates, location, name, url) 
        print(qry) 
       row = row + 1 
       part = 1 
     else: 
      row = row + 1 
    except (RuntimeError, TypeError, NameError): 
     print("Error: " + name)

我需要能夠得到5/01/13之後的每個VEX事件。到目前爲止，這段代碼給了我一個關於日期的錯誤，我似乎無法修復。也許有人比我更好可以修復此代碼？先謝了，史密斯。

編輯＃1：我認爲我需要在字符串的開頭第一刪除換行符

Value Error: '\n10/5/13' does not match format '%m/%d/%y'

：那我得到是錯誤。編輯＃2：得到它運行，沒有任何輸出，任何幫助？

來源

2013-12-23 John

你不必用美麗的湯爲。您可以使用python3 HTMLParser：https：//github.com/schmijos/html-table-parser-python3/blob/master/html_table_parser/parser.py – schmijos

你的問題很差。不知道確切的錯誤是什麼，我猜想問題在於您的if len(dates) < 6:區塊。考慮以下幾點：

>>> date = '10/5/13 - 12/14/13' 
>>> len(date) 
18 
>>> date = '11/9/13' 
>>> len(date) 
7 
>>> date[:6] 
'11/9/1'

一個建議，使你的代碼更Python：而不是做row = row + 1的，使用enumerate。

更新：跟蹤你的代碼，我得到的dates值如下：

>>> dates 
u'\n10/5/13 - 12/14/13   \xa0\n  '

來源

2013-12-23 00:13:04

回答

相關問題