2017-08-30 105 views
1

我正在嘗試爲Data Science 101項目刮冰球參考。我遇到了特定表格的問題。網頁是:https://www.hockey-reference.com/boxscores/201611090BUF.html。所需表格在「高級統計報告(所有情況)」下。我已經嘗試了以下代碼:使用rvest來刮取HTML數據

url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 
ret <- url %>% 
    read_html()%>% 
    html_nodes(xpath='//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))]') %>% 
    html_text() 

此代碼將從上表中刪除所有數據,但在高級表之前停止。我也試圖讓更多的顆粒具有:

url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 
ret <- url %>% 
    read_html()%>% 
    html_nodes(xpath='//*[(@id = "OTT_adv")]//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))]') %>% 
    html_text() 

其產生的「字符(0)」訊息話題。任何和所有的幫助,將不勝感激..如果它尚未明確,我相當新的R.謝謝!

回答

2

您試圖抓取的信息作爲評論隱藏在網頁上。下面是需要一些工作來清理你的最後結果的解決方案:

library(rvest) 
url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 

page<-read_html(url) # parse html 

commentedNodes<-page %>%     
    html_nodes('div.section_wrapper') %>% # select node with comment 
    html_nodes(xpath = 'comment()') # select comments within node 

#there are multiple (3) nodes containing comments 
#chose the 2 via trail and error 
output<-commentedNodes[2] %>% 
    html_text() %>%    # return contents as text 
    read_html() %>%    # parse text as html 
    html_nodes('table') %>%  # select table node 
    html_table()    # parse table and return data.frame 

輸出將是2個元素,每個表的列表。玩家名稱和統計信息會在每個可用選項中重複多次,因此您需要清理此數據以達到最終目的。