使用rvest來刮取HTML數據

我正在嘗試爲Data Science 101項目刮冰球參考。我遇到了特定表格的問題。網頁是：https://www.hockey-reference.com/boxscores/201611090BUF.html。所需表格在「高級統計報告（所有情況）」下。我已經嘗試了以下代碼：使用rvest來刮取HTML數據

url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 
ret <- url %>% 
    read_html()%>% 
    html_nodes(xpath='//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))]') %>% 
    html_text()

此代碼將從上表中刪除所有數據，但在高級表之前停止。我也試圖讓更多的顆粒具有：

url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 
ret <- url %>% 
    read_html()%>% 
    html_nodes(xpath='//*[(@id = "OTT_adv")]//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))]') %>% 
    html_text()

其產生的「字符（0）」訊息話題。任何和所有的幫助，將不勝感激..如果它尚未明確，我相當新的R.謝謝！

來源

2017-08-30 Dan L

您試圖抓取的信息作爲評論隱藏在網頁上。下面是需要一些工作來清理你的最後結果的解決方案：

library(rvest) 
url="https://www.hockey-reference.com/boxscores/201611090BUF.html" 

page<-read_html(url) # parse html 

commentedNodes<-page %>%     
    html_nodes('div.section_wrapper') %>% # select node with comment 
    html_nodes(xpath = 'comment()') # select comments within node 

#there are multiple (3) nodes containing comments 
#chose the 2 via trail and error 
output<-commentedNodes[2] %>% 
    html_text() %>%    # return contents as text 
    read_html() %>%    # parse text as html 
    html_nodes('table') %>%  # select table node 
    html_table()    # parse table and return data.frame

輸出將是2個元素，每個表的列表。玩家名稱和統計信息會在每個可用選項中重複多次，因此您需要清理此數據以達到最終目的。

來源

2017-08-30 23:01:23 Dave2e

使用rvest來刮取HTML數據

回答

相關問題