2013-12-23 144 views
0

我有一張HTML表格,需要解析爲一個CSV文件。解析HTML表格

import urllib2, datetime 
olddate = datetime.datetime.strptime('5/01/13', "%m/%d/%y") 
from BeautifulSoup import BeautifulSoup 
print("dates,location,name,url") 
def genqry(arga,argb,argc,argd): 
return arga + "," + argb + "," + argc + "," + argd 
part = 1 
row = 1 
contenturl = "http://www.robotevents.com/robot-competitions/vex-robotics-competition" 
soup = BeautifulSoup(urllib2.urlopen(contenturl).read()) 
table = soup.find('table', attrs={'class': 'catalog-listing'}) 
rows = table.findAll('tr') 
for tr in rows: 
    try: 
     if row != 1: 
      cols = tr.findAll('td') 
      for td in cols: 
       if part == 1: 
        keep = 0 
        dates = td.find(text=True) 
        part = 2 
       if part == 2: 
        location = td.find(text=True) 
        part = 2 
       if part == 3: 
        name = td.find(text=True) 
        for a in tr.findAll('a', href=True): 
         url = a['href'] 
       # Compare Dates 
       if len(dates) < 6: 
        newdate = datetime.datetime.strptime(dates, "%m/%d/%y") 
        if newdate > olddate: 
         keep = 1 
        else: 
         keep = 0 
       else: 
        newdate = datetime.datetime.strptime(dates[:6], "%m/%d/%y") 
        if newdate > olddate: 
         keep = 1 
        else: 
         keep = 0 
       if keep == 1: 
        qry = genqry(dates, location, name, url) 
        print(qry) 
       row = row + 1 
       part = 1 
     else: 
      row = row + 1 
    except (RuntimeError, TypeError, NameError): 
     print("Error: " + name) 

我需要能夠得到5/01/13之後的每個VEX事件。到目前爲止,這段代碼給了我一個關於日期的錯誤,我似乎無法修復。也許有人比我更好可以修復此代碼?先謝了,史密斯。

編輯#1:我認爲我需要在字符串的開頭第一刪除換行符

Value Error: '\n10/5/13' does not match format '%m/%d/%y' 

:那我得到是錯誤。 編輯#2:得到它運行,沒有任何輸出,任何幫助?

+0

你不必用美麗的湯爲。您可以使用python3 HTMLParser:https://github.com/schmijos/html-table-parser-python3/blob/master/html_table_parser/parser.py – schmijos

回答

0

你的問題很差。不知道確切的錯誤是什麼,我猜想問題在於您的if len(dates) < 6:區塊。考慮以下幾點:

>>> date = '10/5/13 - 12/14/13' 
>>> len(date) 
18 
>>> date = '11/9/13' 
>>> len(date) 
7 
>>> date[:6] 
'11/9/1' 

一個建議,使你的代碼更Python:而不是做row = row + 1的,使用enumerate

更新:跟蹤你的代碼,我得到的dates值如下:

>>> dates 
u'\n10/5/13 - 12/14/13   \xa0\n  '