2017-01-25 91 views
0

這是我的情況:我的代碼從電子郵件中的HTML表中解析出數據。我遇到的障礙是這些表中的一些在桌子中間有空白的空行,如下圖所示。此空白區域導致我的代碼失敗(IndexError: list index out of range),因爲它試圖從單元格中提取文本。如何繞過IndexError

是否可以對Python說:「好吧,如果遇到這些來自這些空白行的錯誤,就停在那裏,從目前爲止已經獲取文本的行並執行其餘代碼那些」...?

這聽起來像是一個愚蠢的解決方案,但我的項目涉及到我只從表中最近的日期獲取數據,總是在前幾行中,並且總是在這些空白空行之前。

因此,如果可以說「如果你遇到這個錯誤,就忽略它並繼續」,那麼我想學習如何做到這一點。如果不是那麼我就不得不找出解決這個問題的另一種方法。感謝任何和所有的幫助。

配合間隙下表:enter image description here

我的代碼:

from bs4 import BeautifulSoup, NavigableString, Tag 
import pandas as pd 
import numpy as np 
import os 
import re 
import email 
import cx_Oracle 

dsnStr = cx_Oracle.makedsn("sole.nefsc.noaa.gov", "1526", "sole") 
con = cx_Oracle.connect(user="user", password="password", dsn=dsnStr) 

def celltext(cell): 
    '''  
     textlist=[] 
     for br in cell.findAll('br'): 
      next = br.nextSibling 
      if not (next and isinstance(next,NavigableString)): 
       continue 
      next2 = next.nextSibling 
      if next2 and isinstance(next2,Tag) and next2.name == 'br': 
       text = str(next).strip() 
       if text: 
        textlist.append(next) 
     return (textlist) 
    ''' 
    textlist=[] 
    y = cell.find('span') 
    for a in y.childGenerator(): 
     if isinstance(a, NavigableString): 
      textlist.append(str(a)) 
    return (textlist) 

path = 'Z:\\blub_2' 

for filename in os.listdir(path): 
    file_path = os.path.join(path, filename) 
    if os.path.isfile(file_path): 
     html=open(file_path,'r').read() 
     soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string 
     table = soup.find_all('table')[1] # Grab the second table 

df_Quota = pd.DataFrame() 

for row in table.find_all('tr'):  
    columns = row.find_all('td') 
    if columns[0].get_text().strip()!='ID': # skip header 
     Quota = celltext(columns[1]) 
     Weight = celltext(columns[2]) 
     price = celltext(columns[3]) 

     print(Quota) 

     Nrows= max([len(Quota),len(Weight),len(price)]) #get the max number of rows 

     IDList = [columns[0].get_text()] * Nrows 
     DateList = [columns[4].get_text()] * Nrows 

     if price[0].strip()=='Package': 
      price = [columns[3].get_text()] * Nrows 

     if len(Quota)<len(Weight):#if Quota has less itmes extend with NaN 
      lstnans= [np.nan]*(len(Weight)-len(Quota)) 
      Quota.extend(lstnans) 

     if len(price) < len(Quota): #if price column has less items than quota column, 
      val = [columns[3].get_text()] * (len(Quota)-len(price)) #extend with 
      price.extend(val)          #whatever is in 
                    #price column 

     #if len(DateList) > len(Quota): #if DateList is longer than Quota, 
      #print("it's longer than") 
      #value = [columns[4].get_text()] * (len(DateList)-len(Quota)) 
      #DateList = value * Nrows 

     if len(Quota) < len(DateList): #if Quota is less than DateList (due to gap), 
      stu = [np.nan]*(len(DateList)-len(Quota)) #extend with NaN 
      Quota.extend(stu) 

     if len(Weight) < len(DateList): 
      dru = [np.nan]*(len(DateList)-len(Weight)) 
      Weight.extend(dru) 

     FinalDataframe = pd.DataFrame(
     { 
     'ID':IDList,  
     'AvailableQuota': Quota, 
     'LiveWeightPounds': Weight, 
     'price':price, 
     'DatePosted':DateList 
     }) 

     df_Quota = df_Quota.append(FinalDataframe, ignore_index=True) 
     #df_Quota = df_Quota.loc[df_Quota['DatePosted']=='5/20'] 
     df_Q = df_Quota['DatePosted'].iloc[0] 
     df_Quota = df_Quota[df_Quota['DatePosted'] == df_Q] 
print (df_Quota) 

for filename in os.listdir(path): 
    file_path = os.path.join(path, filename) 
    if os.path.isfile(file_path): 
     with open(file_path, 'r') as f: 
      pattern = re.compile(r'Sent:.*?\b(\d{4})\b') 
      email = f.read() 
      dates = pattern.findall(email) 
      if dates: 
       print("Date:", ''.join(dates)) 

#cursor = con.cursor() 
#exported_data = [tuple(x) for x in df_Quota.values] 
#sql_query = ("INSERT INTO ROUGHTABLE(species, date_posted, stock_id, pounds, money, sector_name, ask)" "VALUES (:1, :2, :3, :4, :5, 'NEFS 2', '1')") 
#cursor.executemany(sql_query, exported_data) 
#con.commit() 

#cursor.close() 
#con.close() 
+0

只是使用''try'和catch'追趕上'IndexError'忽略 –

+0

@Sarathsp ......如果有一個'catch'。 – tdelaney

+0

在嘗試爲索引建立索引之前,可以使用異常處理程序或檢查事物的大小。這是很多代碼,並沒有暗示這個錯誤實際發生的地方。如果你可以將其歸結爲一個簡單的例子,它將有助於解決方案。 – tdelaney

回答

1

繼續是使用跳過空/問題行的關鍵詞。 IndexError感謝嘗試在空列列表上訪問columns[0]。所以當出現異常時就跳到下一行。

for row in table.find_all('tr'): 
    columns = row.find_all('td') 
    try: 
     if columns[0].get_text().strip()!='ID': 
     # Rest as above in original code. 
    except IndexError: 
     continue 
+0

命令所以你聽起來像它應該工作,但它不會......它在同一行上產生同樣的錯誤('price = celltext(columns [3])') – theprowler

+1

這意味着我們必須處理空行以及列數少的行(少可能比4)。那麼明顯的解決方案是try..except。將** continue **移動到'IndexError'區塊外。 –

1

使用try: ... except: ...

try: 
    #extract data from table 
except IndexError: 
    #execute rest of program