在python中使用beautifulsoup解析表

我想遍歷每行並捕獲td.text的值。然而這裏的問題是表沒有類。所有的td都有相同的類名。我想遍歷每一行，並希望以下輸出：在python中使用beautifulsoup解析表

第一排）「美國足球俱樂部」，「B11EB-美國人 - B11EB-瓦扎拉」，「卡梅隆Coya」，「球員228004」，「2016-09- 10「，」玩家持續侵犯遊戲規則「，」C「（新線）

第二排）」AVIATORS SOCCER CLUB「，」G12DB-AVIATORS-G12DB-REYNGOUDT「，」Saskia Reyes「，」播放器224463" ，「2016年9月11日」，「播放/子犯有違反體育道德的行爲」，「C」（新行）

<div style="overflow:auto; border:1px #cccccc solid;"> 
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%"> 
    <tbody> 
     <tr class="tblHeading"> 
      <td colspan="7">AMERICANS SOCCER CLUB</td> 
     </tr> 
     <tr bgcolor="#CCE4F1"> 
      <td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td> 
     </tr> 
     <tr bgcolor="#FFFFFF"> 
      <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya          </td> 
      <td width="19%" class="tdUnderLine"> 
       Rozel, Max 
      </td> 
      <td width="06%" class="tdUnderLine"> 
      09-11-2016 
      </td> 
      <td width="05%" class="tdUnderLine" align="center">   
       <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>  
      </td> 
      <td width="16%" class="tdUnderLine" align="center"> 
       09/10/16 02:15 PM 
      </td> 
      <td width="30%" class="tdUnderLine">    player persistently infringes the laws of the game </td> 
      <td class="tdUnderLine">    Cautioned </td> 
     </tr> 
     <tr class="tblHeading"> 
      <td colspan="7">AVIATORS SOCCER CLUB</td> 
     </tr> 
     <tr bgcolor="#CCE4F1"> 
      <td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td> 
     </tr> 
     <tr bgcolor="#FBFBFB"> 
      <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes          </td> 
      <td width="19%" class="tdUnderLine"> 
       HollaenderNardelli, Eric 
      </td> 
      <td width="06%" class="tdUnderLine"> 
      09-11-2016 
      </td> 
      <td width="05%" class="tdUnderLine" align="center">   

       <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>  
      </td> 
      <td width="16%" class="tdUnderLine" align="center"> 
       09/11/16 06:45 PM 
      </td> 
      <td width="30%" class="tdUnderLine">    player/sub guilty of unsporting behavior  </td> 
      <td class="tdUnderLine">    Cautioned </td> 
     </tr> 
     <tr class="tblHeading"> 
      <td colspan="7">BERGENFIELD SOCCER CLUB</td> 
     </tr> 
     <tr bgcolor="#CCE4F1"> 
      <td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td> 
     </tr> 
     <tr bgcolor="#FFFFFF"> 
      <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre         </td> 
      <td width="19%" class="tdUnderLine"> 
       Coyle, Kevin 
      </td> 
      <td width="06%" class="tdUnderLine"> 
      09-10-2016 
      </td> 
      <td width="05%" class="tdUnderLine" align="center">   

       <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>  
      </td> 
      <td width="16%" class="tdUnderLine" align="center"> 

       09/10/16 11:00 AM 

      </td> 
      <td width="30%" class="tdUnderLine">    player persistently infringes the laws of the game </td> 
      <td class="tdUnderLine">    Cautioned </td> 
     </tr>

我用下面的代碼嘗試。

import requests 
from bs4 import BeautifulSoup 
import re 
try: 
    import urllib.request as urllib2 
except ImportError: 
    import urllib2 

url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html" 
page = open(url, encoding="utf8") 
soup = BeautifulSoup(page.read(),"html.parser") 

#tableList = soup.findAll("table") 

for tr in soup.find_all("tr"): 
    for td in tr.find_all("td"): 
     print(td.text.strip())

，但很明顯，它會返回文本形式的所有TD和我將無法識別特定的列名或將無法確定新的記錄的開始。我想知道

1）如何識別每一列（因爲類名相同），並有標題，以及（我會，如果你對提供代碼升值）

2）如何識別新紀錄在這樣的結構中

來源

2016-09-14 Bhavesh Ghodasara

你可以給出你需要它的輸出格式的例子 – Sandeep

請檢查它是作爲第一行和第二行給出的問題。它只是樣本，我會需要100個這樣的行。但基本上我需要所有字段逗號分隔，用雙引號括起來。 –

from __future__ import print_function 
import re 
import datetime 
from bs4 import BeautifulSoup 

soup = "" 
with open("/tmp/a.html") as page: 
    soup = BeautifulSoup(page.read(),"html.parser") 

table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table') 

trs = table.find_all('tr') 

table_dict = {} 
game = "" 
section = "" 

for tr in trs: 
    if tr.has_attr('class'): 
     game = tr.text.strip('\n') 
    if tr.has_attr('bgcolor'): 
     if tr['bgcolor'] == '#CCE4F1': 
      section = tr.text.strip('\n') 
     else: 
      tds = tr.find_all('td') 
      extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds] 
      extracted_text = [x.strip() for x in extracted_text] 
      extracted_text = list(filter(lambda x: len(x) > 2, extracted_text)) 
      extracted_text.pop(1) 
      extracted_text[2] = "Player " + extracted_text[2] 
      extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d") 
      extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text] 
      print(','.join(extracted_text))

和運行時：

$ python a.py 

"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C" 
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C" 
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"

根據與OP的進一步對話，輸入爲https://paste.fedoraproject.org/428111/87928814/raw/，運行上述代碼後的輸出爲：https://paste.fedoraproject.org/428110/38792211/raw/

來源

2016-09-14 06:37:41

'trs in chunks（table.find_all（'tr'），3）：'你在這裏如何確定3？它是基於記錄的數量嗎？這裏記錄的數量是動態的。有什麼方法可以在頁面中查找這樣的行數嗎？ –

@BhaveshGhodasara根據OP給出的樣本，記錄有一個特定的格式，它不斷重複。 –

它不是我想要的。如何在文件中保存輸出。我厭倦了以下'saveFile.write（'，'。join（extracted_text））'它只給出一行所有的值。沒有分裂。:( –

如果數據的結構像一張表一樣，那麼很有可能你可以直接用pd.read_table（）將它讀入熊貓。請注意，它接受filepath_or_buffer參數中的url。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

來源

2016-09-14 05:02:18

似乎有一種模式。每隔7 tr（s）後，就有一條新線。所以，你所能做的就是繼續從1開始計數，當它觸及7，追加新行，並重新啓動它以0

counter = 1 
for tr in find_all("tr"): 
    for td in tr.find_all("td"): 
     # place code 
    counter = counter + 1 
    if counter == 7: 
     print "\n" 
     counter = 1

來源

2016-09-14 06:26:40

count = 0 
string = "" 
for td in soup.find_all("td"): 
string += "\""+td.text.strip()+"\"," 
count +=1 
if(count % 9 ==0): 
    print string[:-1] + "\n\n" # string[:-1] to remove the last "," 
    string = ""

由於表格的格式不正確，我們只需要使用td而不是進入每一行，然後進入每行的td，這會使工作複雜化。我只是使用一個字符串，你可以將數據附加到列表中並進行處理以供日後使用。
希望這可以解決你的問題

來源

2016-09-14 06:47:34 Sandeep

在python中使用beautifulsoup解析表

回答

相關問題