使用BeautifulSoup從HTML中抽取具有特定字符串的表格

我試圖使用BeautifulSoup和Python中的請求來提取播放數據，但是此代碼只是爲數組「表格」返回一個空數組[]。我對這些圖書館比較陌生，但是我在使用類似網站（即來自其他大學遊戲的其他逐幀播放數據）執行類似任務時使用了類似的語法。我感興趣的提取文本包含與「頂第1局的」開始的表格中，「第2局」等的底部澄清，如果有不清楚的地方，謝謝，請發表評論！使用BeautifulSoup從HTML中抽取具有特定字符串的表格

from bs4 import BeautifulSoup 

import requests 

header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'} 

url = requests.get("http://www.belmontbruins.com/sports/m-basebl/2016-17/boxscores/20170407_c6td.xml?view=plays", headers = header).text 

soup = BeautifulSoup(url, 'html.parser') 

with open('test.txt','w+') as myfile: 
    table = soup.find_all('table', text = ['Top', 'Bottom']) 
    print(table) 
    for eachtable in table: 
     rows = eachtable.find_all('tr') 
     for tr in rows: 
      cols = tr.find_all('td') 
      for td in cols: 
       myfile.write(td.text + '\n')

來源

2017-06-14 rahlf23

我不清楚你在做什麼？你是否想要提取所有文本，如果以「第一局之頂」，「第二局之底」開頭，我是否正確？只有一個表是正確的？ –

因此，如果您檢查給定網站的HTML ，還有一些單獨的表格，其中包含半場的比賽數據（或字符串）（即第一，第一底部的頂部等）。本質上，我希望能夠縮小我提取的表格只是包含關鍵字'Top'和'Bottom'的表格，然後我將打印文字，例如'MCFARLAND，丹尼爾飛到了左邊的中間位置。爲了回答你的問題，有多個表格，但是我想從每個表格中提取文本。 – rahlf23

好吧，等待我正在更新代碼... :-) –

當您搜索Top|Bottom時，它會在HTML樹中找到文本節點。您可以使用瀏覽器查看該頁面，並且可以看到結構如下所示：table > caption > h3 > "Top of ..." 因此，在找到給定的文本節點後，您必須使用element.parent.parent.parent獲取3個級別，以獲取包含文本節點的表格。

這裏是完整的代碼：

from bs4 import BeautifulSoup 

import re 
import requests 

header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'} 

url = requests.get("http://www.belmontbruins.com/sports/m-basebl/2016-17/boxscores/20170407_c6td.xml?view=plays", headers = header).text 

soup = BeautifulSoup(url, 'html.parser') 

elements = soup(text=re.compile('Top|Bottom')) 
with open('test.txt','w+') as myfile: 
    for element in elements: 
     rows = element.parent.parent.parent.find_all('tr') 
     for tr in rows: 
      cols = tr.find_all('td') 
      for td in cols: 
       myfile.write(td.text + '\n')

來源

2017-06-14 07:49:21 fcs

-1

import requests 
header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'} 

url = requests.get("http://www.belmontbruins.com/sports/m-basebl/2016-17/boxscores/20170407_c6td.xml?view=plays", headers = header).text 

soup = BeautifulSoup(url, 'html.parser') 

table = soup.findAll('table',{'class':'striped'}) 

thefile = open('test.txt', 'w') 
for i in table: 
    for j in i.findAll('td',{'class':'text'}): 
     txt = str(j.get_text()) 
     if txt.startswith('Top of 1st Inning') or txt.startswith('Bottom of 2nd Inning'): 
       thefile.write("%s\n" % item)

來源

2017-06-14 06:54:26

這不幸的是不會寫任何內容test.txt文件因爲'class'：'striped'並不對應'Top of 1st Inning'等表格 – rahlf23

是的，我們需要提到兩個不同的URL： u1 ='http：//www.belmontbruins .com/sports/m-basebl/2016-17/boxscores/20170407_c6td.xml？view = plays＆inning = 1' u2 ='http://www.belmontbruins.com/sports/m-basebl/2016-17/boxscores /20170407_c6td.xml?view=plays&inning=2' –

使用BeautifulSoup從HTML中抽取具有特定字符串的表格

回答

相關問題