創建從HTML表格使用BeautifulSoup

我試圖創建一個包含網頁上的「操作」表中的數據像這樣的一個分隔文本文件的文本分隔的文件：http://stats.swehockey.se/Game/Events/300978 創建從HTML表格使用BeautifulSoup

我想每一行包括遊戲＃（從URL的末尾開始），然後包括表格中的行。例如：

300972 | 60:00 | GK Out | OHK | 33. Hudacek, Julius

我還沒有能夠讓每一行實際分開。我試着通過每行和每列解析，使用一個剝離字符串列表，並通過不同的標籤，類和樣式進行搜索。

這是我目前有：

from bs4 import BeautifulSoup 
import urllib.request 

def createtext(): 
    gamestr = urlstr + "|" 
    #Find all table lines. Create one pipe-delimited line for each. 
    aptext = gamestr 
    for el in soup.find_all('tr'): 
     playrow = el.find_all('td', 'tdOdd') 
     for td in playrow: 
      if(td.find(text=True)) not in ("", None, "\n"): 
       aptext = aptext + ''.join(td.text) + "|" 
     aptext = aptext + "\n" + gamestr 

    #Creates file with Game # as filename and writes the data to the file 
    currentfile = urlstr + ".txt" 
    with open(currentfile, "w") as f: 
     f.write(str(aptext))       

#Grabs the HTML file and creates the soup 
urlno = 300978 
urlstr = str(urlno) 
url = ("http://stats.swehockey.se/Game/Events/" + urlstr) 
request = urllib.request.Request(url) 
response = urllib.request.urlopen(request) 
pbpdoc = response.read().decode('utf-8') 
soup = BeautifulSoup(pbpdoc) 
createtext()

感謝任何幫助或指導！

來源

2017-05-27 Not Dave

首先，您不必手動構建CSV數據，Python爲此提供了一個內置的csv module。然後，由於您只需要「操作」，因此我需要確定「操作」表並查找僅限事件的行。這可以用過濾功能檢查第一個單元格不爲空的幫助來完成：

import csv 

from bs4 import BeautifulSoup 
import requests 


def only_action_rows(tag): 
    if tag.name == 'tr': 
     first_cell = tag.find('td', class_='tdOdd') 
     return first_cell and first_cell.get_text(strip=True) 


event_id = 300978 
url = "http://stats.swehockey.se/Game/Events/{event_id}".format(event_id=event_id) 
response = requests.get(url) 

soup = BeautifulSoup(response.text, "html.parser") 

actions_table = soup.find("h2", text="Actions").find_parent("table") 
data = [[event_id] + [td.get_text(strip=True) for td in row.find_all('td', class_='tdOdd')] 
     for row in actions_table.find_all(only_action_rows)] 

with open("output.csv", "w") as f: 
    writer = csv.writer(f) 
    writer.writerows(data)

請注意，我使用requests這裏。

來源

2017-05-27 01:36:37 alecxe

感謝您的回覆和代碼！當我嘗試使用它時，我遇到了這個錯誤：'NoneType'對象沒有屬性'find_parent' –

@NotDave好吧，看起來你有一箇舊的bs4版本 - 試着用'text'而不是'string'。 – alecxe

這個伎倆！非常感謝！ –

創建從HTML表格使用BeautifulSoup

回答

相關問題