2016-01-21 62 views
1

下面是我從HTML文檔中提取數據並將其放入變量的代碼。我需要排除空行,​​以及「總計」行。我在代碼下添加了這些段的HTML輸入。我不知道如何使它工作。我不能使用len(),因爲長度是可變的。任何幫助?Python&BS4:排除空白和總共行

from bs4 import BeautifulSoup 
import urllib 
import re 
import HTMLParser 
html = urllib.urlopen('RanpakAllocations.html').read() 
parser = HTMLParser.HTMLParser() 
#unescape doesn't seem to work 
output = parser.unescape(html) 

soup1 = BeautifulSoup(output, "html.parser") 
Customer_No = [] 
Serial_No = [] 
data = [] 
#for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): 
rows = soup1.find_all("tr") 
title = rows[0] 
headers = rows[1] 
datarows = rows[2:] 

fields = [] 

try : 
    for row in datarows : 
     find_data = row.find_all(attrs={'face' : 'Arial,Helvetica,sans-serif'}) 
     count = 0 
     for hit in find_data: 
      data = hit.text 
      count = count + 1 
      if count == 3 : 
       CSNO = data 
      if count == 9 : 
       ITNO = data 
      else : 
       continue 

     print CSNO, ITNO 
     print "new row" 
except: 
    pass 

這裏是輸入。第一個<tr>是我的最後一行數據,但是我的循環正在重複空行和它下面的總行。

<tr> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">12</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">F5684</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20182</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">VELOCITY SOLUTIONS INC.</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77717</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT 2</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">W/UNIVERSAL STAND S/N 51345563</font></td> 
     <td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">51345563</font></td> 
     </tr> 
     <tr> 
     <td nowrap="nowrap" align="left"><font size="1">&nbsp;</font></td> 
     <td nowrap="nowrap" align="left"><font size="1">&nbsp;</font></td> 
     <td nowrap="nowrap" align="left"><font size="1">&nbsp;</font></td> 
     <td nowrap="nowrap" align="left"><font size="1">&nbsp;</font></td> 
     <td align="left" colspan="5"><font size="1">&nbsp;</font></td> 
     </tr> 
     <tr> 
     <td align="left"><font size="3" face="Arial,Helvetica,sans-serif">&nbsp;</font></td> 
     <td align="left"><font size="3" face="Arial,Helvetica,sans-serif">Grand Total</font></td> 
     <td align="left" colspan="7"><font size="1">&nbsp;</font></td> 
     </tr> 
     <tr> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     </tr> 
+0

我看不到HTML @AlliDeacon? – gtlambert

+0

我已經添加了HTML @ lambo477感謝您的任何指導。 – AlliDeacon

+0

好吧,我已經添加了'如果len(find_data)> 0:'這樣就消除了空白行,但是我仍然有很大的總體工作空間。我會嘗試將'datarows'的範圍包括在內' – AlliDeacon

回答

0

我會做這樣的事情:

from bs4 import BeautifulSoup 

content = ''' 
<root> 
    <tr> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">12</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">F5684</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20182</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">VELOCITY SOLUTIONS INC.</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77717</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT 2</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">W/UNIVERSAL STAND S/N 51345563</font></td> 
     <td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td> 
     <td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">51345563</font></td> 
    </tr> 
    <tr> 
     <td nowrap="nowrap" align="left"><font size="1">&nbsp;</font></td> 
     <td nowrap="nowrap" align="left"><font size="1">&nbsp;</font></td> 
     <td nowrap="nowrap" align="left"><font size="1">&nbsp;</font></td> 
     <td nowrap="nowrap" align="left"><font size="1">&nbsp;</font></td> 
     <td align="left" colspan="5"><font size="1">&nbsp;</font></td> 
    </tr> 
    <tr> 
     <td align="left"><font size="3" face="Arial,Helvetica,sans-serif">&nbsp;</font></td> 
     <td align="left"><font size="3" face="Arial,Helvetica,sans-serif">Grand Total</font></td> 
     <td align="left" colspan="7"><font size="1">&nbsp;</font></td> 
    </tr> 
    <tr> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
    </tr> 
</root>''' 

soup = BeautifulSoup(content, 'html') 

answer = [] 
rows = soup.find_all('tr') 

for row in rows: 
    if not row.text.strip(): 
     continue 

    row_text = [] 
    for cell in row.find_all('td'): 
     if cell.text.strip(): 
      row_text.append(cell.text) 

    answer.append(row_text) 

print(answer) 

輸出

[[u'12', u'F5684', u'20182', u'VELOCITY SOLUTIONS INC.', u'EQPRAN77717', u'RANPAK FILLPAK TT 2', u'W/UNIVERSAL STAND S/N 51345563', u'1', u'51345563'], [u'Grand Total']] 

你可以跳過使用if not row.text.strip(): continue整個空行(row.text.strip()返回一個空字符串,其計算結果爲False)。

對於您迭代的行,您可以在保存相關文本之前使用if cell.text.strip()檢查每個單元格是否爲空。

+0

這太棒了!我最終使用了'len()'和'datarows',我將範圍更新爲[2:( - 3)],並且這似乎能夠滿足我的需求。我會把這個把戲放在我的口袋裏! – AlliDeacon

+0

好東西 - 保持練習:) – gtlambert