2017-07-15 39 views
1

我試圖從這個特定網頁webscrape統計:https://www.sports-reference.com/cfb/schools/louisville/2016/gamelog/的訪問評論HTML線,BeautifulSoup

然而,出現了「防守日誌」表被註釋掉當我在看的HTML源代碼(因此,當試圖使用BeautifulSoup4時,以下代碼只抓取在防禦性數據被註釋掉時未被註釋掉的冒犯性數據。

from urllib.request import Request,urlopen 
from bs4 import BeautifulSoup 
import re 

accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/' 
req = Request(accessurl) 
link = urlopen(req) 
soup = BeautifulSoup(link.read(), "lxml") 


tables = soup.find_all(['th', 'tr']) 
my_table = tables[0] 
rows = my_table.findChildren(['tr']) 
for row in rows: 
    cells = row.findChildren('td') 
    for cell in cells: 
     value = cell.string 
     print(value) 

我很好奇,如果有任何解決方案,能夠將所有的防禦值的添加到列表中以同樣的方式在進攻數據存儲無論是內部還是BeautifulSoup4之外。謝謝!

注意,我加入到解決方案如下來源於here

data = [] 

table = defensive_log 
table_body = table.find('tbody') 

rows = table_body.find_all('tr') 
for row in rows: 
    cols = row.find_all('td') 
    cols = [ele.text.strip() for ele in cols] 
    data.append([ele for ele in cols if ele]) # Get rid of empty values 
+0

你是什麼意思的「註釋」嗎? – snapcrack

回答

2

Comment對象會給你想要的東西:

from urllib.request import Request,urlopen 
from bs4 import BeautifulSoup, Comment 

accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/' 
req = Request(accessurl) 
link = urlopen(req) 
soup = BeautifulSoup(link, "lxml") 

comments=soup.find_all(string=lambda text:isinstance(text,Comment)) 
for comment in comments: 
    comment=BeautifulSoup(str(comment), 'lxml') 
    defensive_log = comment.find('table') #search as ordinary tag 
    if defensive_log: 
     break 
+0

@Storm,有沒有反饋?我的解決方案有用嗎? –

+0

很抱歉,需要很長時間才能回到你身邊 - 我一直在移動並最終回到項目。我正在通過它嘗試將其合併。 – Storm

+0

我從[這裏]添加了以下代碼(https://stackoverflow.com/questions/23377533/python-beautifulsoup-parsing-table)。它允許我把它放在一張桌子上。我在上面的問題中放入最終的代碼字符串。 – Storm