在我以前的文章中,我想抓取HKJC上的一些賽馬數據。感謝Dmitriy Fialkovskiy的幫助,我通過稍微修改給定的代碼來實現它。然而,當我試圖瞭解背後的邏輯,有一個線無法解釋說:Beautifulsoup網頁抓取
from bs4 import BeautifulSoup as BS
import requests
import pandas as pd
url_list = ['http://www.hkjc.com/english/racing/horse.asp?HorseNo=S217','http://www.hkjc.com/english/racing/horse.asp?HorseNo=A093','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V344','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V077', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=P361', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=T103']
res=[] #placing res outside of loop
for link in url_list:
r = requests.get(link)
r.encoding = 'utf-8'
html_content = r.text
soup = BS(html_content, 'lxml')
table = soup.find('table', class_='bigborder')
if not table:
continue
trs = table.find_all('tr')
if not trs:
continue #if trs are not found, then starting next iteration with other link
headers = trs[0]
headers_list=[]
for td in headers.find_all('td'):
headers_list.append(td.text)
headers_list+=['Season']
headers_list.insert(19,'pseudocol1')
headers_list.insert(20,'pseudocol2')
headers_list.insert(21,'pseudocol3')
row = []
season = ''
for tr in trs[1:]:
if 'Season' in tr.text:
season = tr.text
else:
tds = tr.find_all('td')
for td in tds:
row.append(td.text.strip('\n').strip('\r').strip('\t').strip('"').strip())
row.append(season.strip())
res.append(row)
row=[]
res = [i for i in res if i[0]!=''] #outside of loop
df=pd.DataFrame(res, columns=headers_list) #outside of loop
del df['pseudocol1'],df['pseudocol2'],df['pseudocol3']
del df['VideoReplay']
我不知道什麼是在else
條件增加了重複row =[]
的目的,爲什麼會作品。謝謝。
作爲一個有趣的練習,用'row.clear()'替換'row = []'並觀察魔法。 –
res成爲:[[],[],[],...]這是什麼意思? –