2016-09-14 139 views
1

我想遍歷每行並捕獲td.text的值。然而這裏的問題是表沒有類。所有的td都有相同的類名。我想遍歷每一行,並希望以下輸出:在python中使用beautifulsoup解析表

第一排)「美國足球俱樂部」,「B11EB-美國人 - B11EB-瓦扎拉」,「卡梅隆Coya」,「球員228004」,「2016-09- 10「,」玩家持續侵犯遊戲規則「,」C「(新線)

第二排)」AVIATORS SOCCER CLUB「,」G12DB-AVIATORS-G12DB-REYNGOUDT「,」Saskia Reyes「,」播放器224463" , 「2016年9月11日」, 「播放/子犯有違反體育道德的行爲」, 「C」(新行)

<div style="overflow:auto; border:1px #cccccc solid;"> 
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%"> 
    <tbody> 
     <tr class="tblHeading"> 
      <td colspan="7">AMERICANS SOCCER CLUB</td> 
     </tr> 
     <tr bgcolor="#CCE4F1"> 
      <td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td> 
     </tr> 
     <tr bgcolor="#FFFFFF"> 
      <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya          </td> 
      <td width="19%" class="tdUnderLine"> 
       Rozel, Max 
      </td> 
      <td width="06%" class="tdUnderLine"> 
      09-11-2016 
      </td> 
      <td width="05%" class="tdUnderLine" align="center">   
       <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>  
      </td> 
      <td width="16%" class="tdUnderLine" align="center"> 
       09/10/16 02:15 PM 
      </td> 
      <td width="30%" class="tdUnderLine">    player persistently infringes the laws of the game </td> 
      <td class="tdUnderLine">    Cautioned </td> 
     </tr> 
     <tr class="tblHeading"> 
      <td colspan="7">AVIATORS SOCCER CLUB</td> 
     </tr> 
     <tr bgcolor="#CCE4F1"> 
      <td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td> 
     </tr> 
     <tr bgcolor="#FBFBFB"> 
      <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes          </td> 
      <td width="19%" class="tdUnderLine"> 
       HollaenderNardelli, Eric 
      </td> 
      <td width="06%" class="tdUnderLine"> 
      09-11-2016 
      </td> 
      <td width="05%" class="tdUnderLine" align="center">   

       <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>  
      </td> 
      <td width="16%" class="tdUnderLine" align="center"> 
       09/11/16 06:45 PM 
      </td> 
      <td width="30%" class="tdUnderLine">    player/sub guilty of unsporting behavior  </td> 
      <td class="tdUnderLine">    Cautioned </td> 
     </tr> 
     <tr class="tblHeading"> 
      <td colspan="7">BERGENFIELD SOCCER CLUB</td> 
     </tr> 
     <tr bgcolor="#CCE4F1"> 
      <td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td> 
     </tr> 
     <tr bgcolor="#FFFFFF"> 
      <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre         </td> 
      <td width="19%" class="tdUnderLine"> 
       Coyle, Kevin 
      </td> 
      <td width="06%" class="tdUnderLine"> 
      09-10-2016 
      </td> 
      <td width="05%" class="tdUnderLine" align="center">   

       <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>  
      </td> 
      <td width="16%" class="tdUnderLine" align="center"> 

       09/10/16 11:00 AM 

      </td> 
      <td width="30%" class="tdUnderLine">    player persistently infringes the laws of the game </td> 
      <td class="tdUnderLine">    Cautioned </td> 
     </tr> 

我用下面的代碼嘗試。

import requests 
from bs4 import BeautifulSoup 
import re 
try: 
    import urllib.request as urllib2 
except ImportError: 
    import urllib2 

url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html" 
page = open(url, encoding="utf8") 
soup = BeautifulSoup(page.read(),"html.parser") 

#tableList = soup.findAll("table") 

for tr in soup.find_all("tr"): 
    for td in tr.find_all("td"): 
     print(td.text.strip()) 

,但很明顯,它會返回文本形式的所有TD和我將無法識別特定的列名或將無法確定新的記錄的開始。我想知道

1)如何識別每一列(因爲類名相同),並有標題,以及(我會,如果你對提供代碼升值)

2)如何識別新紀錄在這樣的結構中

+0

你可以給出你需要它的輸出格式的例子 – Sandeep

+0

請檢查它是作爲第一行和第二行給出的問題。它只是樣本,我會需要100個這樣的行。但基本上我需要所有字段逗號分隔,用雙引號括起來。 –

回答

0
from __future__ import print_function 
import re 
import datetime 
from bs4 import BeautifulSoup 

soup = "" 
with open("/tmp/a.html") as page: 
    soup = BeautifulSoup(page.read(),"html.parser") 

table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table') 

trs = table.find_all('tr') 

table_dict = {} 
game = "" 
section = "" 

for tr in trs: 
    if tr.has_attr('class'): 
     game = tr.text.strip('\n') 
    if tr.has_attr('bgcolor'): 
     if tr['bgcolor'] == '#CCE4F1': 
      section = tr.text.strip('\n') 
     else: 
      tds = tr.find_all('td') 
      extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds] 
      extracted_text = [x.strip() for x in extracted_text] 
      extracted_text = list(filter(lambda x: len(x) > 2, extracted_text)) 
      extracted_text.pop(1) 
      extracted_text[2] = "Player " + extracted_text[2] 
      extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d") 
      extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text] 
      print(','.join(extracted_text)) 

和運行時:

$ python a.py 

"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C" 
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C" 
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C" 

根據與OP的進一步對話,輸入爲https://paste.fedoraproject.org/428111/87928814/raw/,運行上述代碼後的輸出爲:https://paste.fedoraproject.org/428110/38792211/raw/

+0

'trs in chunks(table.find_all('tr'),3):'你在這裏如何確定3?它是基於記錄的數量嗎?這裏記錄的數量是動態的。有什麼方法可以在頁面中查找這樣的行數嗎? –

+0

@BhaveshGhodasara根據OP給出的樣本,記錄有一個特定的格式,它不斷重複。 –

+0

它不是我想要的。如何在文件中保存輸出。我厭倦了以下'saveFile.write(','。join(extracted_text))'它只給出一行所有的值。沒有分裂。:( –

0

似乎有一種模式。每隔7 tr(s)後,就有一條新線。 所以,你所能做的就是繼續從1開始計數,當它觸及7,追加新行,並重新啓動它以0

counter = 1 
for tr in find_all("tr"): 
    for td in tr.find_all("td"): 
     # place code 
    counter = counter + 1 
    if counter == 7: 
     print "\n" 
     counter = 1 
1
count = 0 
string = "" 
for td in soup.find_all("td"): 
string += "\""+td.text.strip()+"\"," 
count +=1 
if(count % 9 ==0): 
    print string[:-1] + "\n\n" # string[:-1] to remove the last "," 
    string = "" 

由於表格的格式不正確,我們只需要使用td而不是進入每一行,然後進入每行的td,這會使工作複雜化。我只是使用一個字符串,你可以將數據附加到列表中並進行處理以供日後使用。
希望這可以解決你的問題