這是我的情況:我的代碼從電子郵件中的HTML表中解析出數據。我遇到的障礙是這些表中的一些在桌子中間有空白的空行,如下圖所示。此空白區域導致我的代碼失敗(IndexError: list index out of range
),因爲它試圖從單元格中提取文本。如何繞過IndexError
是否可以對Python說:「好吧,如果遇到這些來自這些空白行的錯誤,就停在那裏,從目前爲止已經獲取文本的行並執行其餘代碼那些」...?
這聽起來像是一個愚蠢的解決方案,但我的項目涉及到我只從表中最近的日期獲取數據,總是在前幾行中,並且總是在這些空白空行之前。
因此,如果可以說「如果你遇到這個錯誤,就忽略它並繼續」,那麼我想學習如何做到這一點。如果不是那麼我就不得不找出解決這個問題的另一種方法。感謝任何和所有的幫助。
我的代碼:
from bs4 import BeautifulSoup, NavigableString, Tag
import pandas as pd
import numpy as np
import os
import re
import email
import cx_Oracle
dsnStr = cx_Oracle.makedsn("sole.nefsc.noaa.gov", "1526", "sole")
con = cx_Oracle.connect(user="user", password="password", dsn=dsnStr)
def celltext(cell):
'''
textlist=[]
for br in cell.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
textlist.append(next)
return (textlist)
'''
textlist=[]
y = cell.find('span')
for a in y.childGenerator():
if isinstance(a, NavigableString):
textlist.append(str(a))
return (textlist)
path = 'Z:\\blub_2'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
html=open(file_path,'r').read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[1] # Grab the second table
df_Quota = pd.DataFrame()
for row in table.find_all('tr'):
columns = row.find_all('td')
if columns[0].get_text().strip()!='ID': # skip header
Quota = celltext(columns[1])
Weight = celltext(columns[2])
price = celltext(columns[3])
print(Quota)
Nrows= max([len(Quota),len(Weight),len(price)]) #get the max number of rows
IDList = [columns[0].get_text()] * Nrows
DateList = [columns[4].get_text()] * Nrows
if price[0].strip()=='Package':
price = [columns[3].get_text()] * Nrows
if len(Quota)<len(Weight):#if Quota has less itmes extend with NaN
lstnans= [np.nan]*(len(Weight)-len(Quota))
Quota.extend(lstnans)
if len(price) < len(Quota): #if price column has less items than quota column,
val = [columns[3].get_text()] * (len(Quota)-len(price)) #extend with
price.extend(val) #whatever is in
#price column
#if len(DateList) > len(Quota): #if DateList is longer than Quota,
#print("it's longer than")
#value = [columns[4].get_text()] * (len(DateList)-len(Quota))
#DateList = value * Nrows
if len(Quota) < len(DateList): #if Quota is less than DateList (due to gap),
stu = [np.nan]*(len(DateList)-len(Quota)) #extend with NaN
Quota.extend(stu)
if len(Weight) < len(DateList):
dru = [np.nan]*(len(DateList)-len(Weight))
Weight.extend(dru)
FinalDataframe = pd.DataFrame(
{
'ID':IDList,
'AvailableQuota': Quota,
'LiveWeightPounds': Weight,
'price':price,
'DatePosted':DateList
})
df_Quota = df_Quota.append(FinalDataframe, ignore_index=True)
#df_Quota = df_Quota.loc[df_Quota['DatePosted']=='5/20']
df_Q = df_Quota['DatePosted'].iloc[0]
df_Quota = df_Quota[df_Quota['DatePosted'] == df_Q]
print (df_Quota)
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
pattern = re.compile(r'Sent:.*?\b(\d{4})\b')
email = f.read()
dates = pattern.findall(email)
if dates:
print("Date:", ''.join(dates))
#cursor = con.cursor()
#exported_data = [tuple(x) for x in df_Quota.values]
#sql_query = ("INSERT INTO ROUGHTABLE(species, date_posted, stock_id, pounds, money, sector_name, ask)" "VALUES (:1, :2, :3, :4, :5, 'NEFS 2', '1')")
#cursor.executemany(sql_query, exported_data)
#con.commit()
#cursor.close()
#con.close()
只是使用''try'和catch'追趕上'IndexError'忽略 –
@Sarathsp ......如果有一個'catch'。 – tdelaney
在嘗試爲索引建立索引之前,可以使用異常處理程序或檢查事物的大小。這是很多代碼,並沒有暗示這個錯誤實際發生的地方。如果你可以將其歸結爲一個簡單的例子,它將有助於解決方案。 – tdelaney