使用Python從網頁獲取表格

關於python網頁抓取的關於無關的知識。使用Python從網頁獲取表格

我需要從this頁面得到一個表：

http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF

我感興趣的表是這樣的：（忽略表上方的圖表）

這是我現在有：

from selenium import webdriver 
from bs4 import BeautifulSoup 

# load chrome driver 
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver') 

# load web page and get source html 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
html = driver.page_source 

# make soup and get all tables 
soup = BeautifulSoup(html, 'html.parser') 
tables = soup.findAll('table',{'class':'r_table3'}) 
tbl = tables[1] # ideally we should select table by name

我從哪裏出發？

來源

2017-10-17 Ledger Yu

有沒有什麼建議同時使用BeautifulSoup和硒具體的原因是什麼？ – Goralight

有人告訴我，當頁面嵌入JavaScript時，你需要先加載它，然後用美麗的方式解析？ –

我並不是說這是問題，而是因爲你需要它的原因 - 你需要整桌嗎？或者一個特定的細胞？ – Goralight

要想從該網頁中的數據，你可以去這樣的：

from selenium import webdriver 
from bs4 import BeautifulSoup 
import time 

driver = webdriver.Chrome() 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
time.sleep(3) 

soup = BeautifulSoup(driver.page_source, 'lxml') 
driver.quit() 

tab_data = soup.select('table')[1] 
for items in tab_data.select('tr'): 
    item = [elem.text for elem in items.select('th,td')] 
    print(' '.join(item))

部分結果：

Total Return %  1-Day 1-Week 1-Month 3-Month YTD 1-Year 3-Year 5-Year 10-Year 15-Year 
IWF (Price) 0.13 0.83 2.68 5.67 23.07 26.60 15.52 15.39 8.97 10.14 
IWF (NAV) 0.20 0.86 2.66 5.70 23.17 26.63 15.52 15.40 8.98 10.14 
S&P 500 TR USD (Price) 0.18 0.52 2.42 4.52 16.07 22.40 13.51 14.34 7.52 9.76

來源

2017-10-17 10:17:25 SIM

你執行過代碼嗎？如果是，那麼你的反饋是什麼？你沒有從該表中獲取數據嗎？ – SIM

OK所以這裏是我是如何做的：

from selenium import webdriver 
from bs4 import BeautifulSoup 

# load chrome driver 
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver') 

# load web page and get source html 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
html = driver.page_source 

# make soup and get table 
soup = BeautifulSoup(html, 'html.parser') 
tables = soup.find_all('table',{'class':'r_table3'}) 
tbl = tables[1] # ideally we should select table by name 

# column and row names 
rows = tbl.find_all('tr') 
column_names = [x.get_text() for x in rows[0].find_all('th')[1:]] 
row_names = [x.find_all('th')[0].get_text() for x in rows[1:]] 

# table content 
df = pd.DataFrame(columns=column_names, index=row_names) 
for row in rows[1:]: 
    row_name = row.find_all('th')[0].get_text() 
    df.ix[row_name] = [column.get_text() for column in row.find_all('td')] 
print(df)

有沒有更優雅的方式，即不通過行和列等循環，但關閉的，現成的方法，我可以打電話？

來源

2017-10-17 10:03:01

使用Python從網頁獲取表格

回答

相關問題