Beautifulsoup網頁抓取

在我以前的文章中，我想抓取HKJC上的一些賽馬數據。感謝Dmitriy Fialkovskiy的幫助，我通過稍微修改給定的代碼來實現它。然而，當我試圖瞭解背後的邏輯，有一個線無法解釋說：Beautifulsoup網頁抓取

from bs4 import BeautifulSoup as BS 
import requests 
import pandas as pd 

url_list = ['http://www.hkjc.com/english/racing/horse.asp?HorseNo=S217','http://www.hkjc.com/english/racing/horse.asp?HorseNo=A093','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V344','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V077', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=P361', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=T103'] 


res=[] #placing res outside of loop 
for link in url_list: 
    r = requests.get(link) 
    r.encoding = 'utf-8' 

    html_content = r.text 
    soup = BS(html_content, 'lxml') 


    table = soup.find('table', class_='bigborder') 
    if not table: 
     continue 

    trs = table.find_all('tr') 

    if not trs: 
     continue #if trs are not found, then starting next iteration with other link 


    headers = trs[0] 
    headers_list=[] 
    for td in headers.find_all('td'): 
     headers_list.append(td.text) 
    headers_list+=['Season'] 
    headers_list.insert(19,'pseudocol1') 
    headers_list.insert(20,'pseudocol2') 
    headers_list.insert(21,'pseudocol3') 

    row = [] 
    season = '' 
    for tr in trs[1:]: 
     if 'Season' in tr.text: 
      season = tr.text 

     else: 
      tds = tr.find_all('td') 
      for td in tds: 
       row.append(td.text.strip('\n').strip('\r').strip('\t').strip('"').strip()) 
      row.append(season.strip()) 
      res.append(row) 
      row=[] 

res = [i for i in res if i[0]!=''] #outside of loop 

df=pd.DataFrame(res, columns=headers_list) #outside of loop 
del df['pseudocol1'],df['pseudocol2'],df['pseudocol3'] 
del df['VideoReplay']

我不知道什麼是在else條件增加了重複row =[]的目的，爲什麼會作品。謝謝。

來源

2017-07-06 JAY.Y

作爲一個有趣的練習，用'row.clear（）'替換'row = []'並觀察魔法。 –

res成爲：[[]，[]，[]，...]這是什麼意思？ –

循環內部的row=[]清除列表，使其重新變空。由於該列表在for循環之前被聲明過一次，因此它將繼承在另一個for迭代中附加的元素。做row=[]再次清除它到一個空的列表。

來源

2017-07-06 13:20:15

應該補充的是，你將它分配給一個新的空白列表，而不僅僅是清除它。你可以用del行[：]清除它，但是你的res會受到影響。 – corn3lius

您的意思是，如果沒有row = []，結果將是res = [['A']，['A'，'B']，['A'，'B'，'C']而不是[ ['A'，'B'，'C']]？ –

我看到它的方式，如果你沒有重置row那麼你總是會重複前面結果的存儲，越來越多，res.append(row)就在上面。

來源

2017-07-06 13:22:05 Fabien

Beautifulsoup網頁抓取

回答

相關問題