2017-10-17 147 views
0

關於python網頁抓取的關於無關的知識。使用Python從網頁獲取表格

我需要從this頁面得到一個表:

http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF 

我感興趣的表是這樣的: enter image description here (忽略表上方的圖表)

這是我現在有:

from selenium import webdriver 
from bs4 import BeautifulSoup 

# load chrome driver 
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver') 

# load web page and get source html 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
html = driver.page_source 

# make soup and get all tables 
soup = BeautifulSoup(html, 'html.parser') 
tables = soup.findAll('table',{'class':'r_table3'}) 
tbl = tables[1] # ideally we should select table by name 

我從哪裏出發?

+0

有沒有什麼建議同時使用BeautifulSoup和硒具體的原因是什麼? – Goralight

+0

有人告訴我,當頁面嵌入JavaScript時,你需要先加載它,然後用美麗的方式解析? –

+0

我並不是說這是問題,而是因爲你需要它的原因 - 你需要整桌嗎?或者一個特定的細胞? – Goralight

回答

1

要想從該網頁中的數據,你可以去這樣的:

from selenium import webdriver 
from bs4 import BeautifulSoup 
import time 

driver = webdriver.Chrome() 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
time.sleep(3) 

soup = BeautifulSoup(driver.page_source, 'lxml') 
driver.quit() 

tab_data = soup.select('table')[1] 
for items in tab_data.select('tr'): 
    item = [elem.text for elem in items.select('th,td')] 
    print(' '.join(item)) 

部分結果:

Total Return %  1-Day 1-Week 1-Month 3-Month YTD 1-Year 3-Year 5-Year 10-Year 15-Year 
IWF (Price) 0.13 0.83 2.68 5.67 23.07 26.60 15.52 15.39 8.97 10.14 
IWF (NAV) 0.20 0.86 2.66 5.70 23.17 26.63 15.52 15.40 8.98 10.14 
S&P 500 TR USD (Price) 0.18 0.52 2.42 4.52 16.07 22.40 13.51 14.34 7.52 9.76 
+0

你執行過代碼嗎?如果是,那麼你的反饋是什麼?你沒有從該表中獲取數據嗎? – SIM

0

OK所以這裏是我是如何做的:

from selenium import webdriver 
from bs4 import BeautifulSoup 

# load chrome driver 
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver') 

# load web page and get source html 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
html = driver.page_source 

# make soup and get table 
soup = BeautifulSoup(html, 'html.parser') 
tables = soup.find_all('table',{'class':'r_table3'}) 
tbl = tables[1] # ideally we should select table by name 

# column and row names 
rows = tbl.find_all('tr') 
column_names = [x.get_text() for x in rows[0].find_all('th')[1:]] 
row_names = [x.find_all('th')[0].get_text() for x in rows[1:]] 

# table content 
df = pd.DataFrame(columns=column_names, index=row_names) 
for row in rows[1:]: 
    row_name = row.find_all('th')[0].get_text() 
    df.ix[row_name] = [column.get_text() for column in row.find_all('td')] 
print(df) 

有沒有更優雅的方式,即不通過行和列等循環,但關閉的,現成的方法,我可以打電話?