如何使用rvest（）獲取表格

我想使用rvest軟件包從Pro Football Reference網站獲取一些數據。首先，讓我們抓住從這個網址http://www.pro-football-reference.com/years/2015/games.htm如何使用rvest（）獲取表格

library("rvest") 
library("dplyr") 

#grab table info 
url <- "http://www.pro-football-reference.com/years/2015/games.htm" 
urlHtml <- url %>% read_html() 
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()

在2015年玩過的所有遊戲的結果是這樣，你怎麼會做呢？ :)

dat可能會被清理一下。其中兩個變量似乎對姓名有空白。另外標題行在每週之間重複。

colnames(dat) <- c("week", "day", "date", "winner", "at", "loser", 
        "box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL") 

dat2 <- dat %>% filter(!(box == "")) 
head(dat2)

看起來不錯！

現在讓我們來看一個單獨的遊戲。在上面的網頁上，點擊表格第一行的「Boxscore」：9月10日比賽在新英格蘭和匹茲堡之間進行。這需要我們在這裏：http://www.pro-football-reference.com/boxscores/201509100nwe.htm。

我想抓住每個玩家的個別對齊計數（大約在頁面中間的一半）。很確定這些將是我們的前兩行代碼：

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" 
gameHtml <- gameUrl %>% read_html()

但現在我無法弄清楚如何抓住我想要的特定表。我使用Selector Gadget來突出顯示Patriots snap計數表。我通過點擊幾個地方的表格來做到這一點，然後'取消'突出顯示的其他表格。我最終的路徑：

這些嘗試

#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left

每個返回{xml_nodeset (0)}

gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left") 
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left") 
gameHtml %>% html_nodes("#home_snap_counts .right") 
gameHtml %>% html_nodes("#home_snap_counts")

也許讓我們嘗試使用xpath。所有這些嘗試也將返回{xml_nodeset (0)}

gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "right", " "))] | //*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "left", " "))]//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "left", " "))]//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "tooltip", " "))]//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "), concat(" ", "left", " "))]') 
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat(" ", @class, " "))]') 
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]')

我該如何抓取該表？我還會指出，當我在Google Chrome瀏覽器中查看頁面源代碼時，我想要的表格幾乎似乎已被註釋掉了。也就是說，它們以綠色打印，而不是通常的紅色/黑色/藍色配色方案。我們先抽出的比賽結果並非如此。該表格的「查看頁面源代碼」是通常的紅/黑/藍顏色方案。綠色是否代表什麼阻止了我能夠抓住這個快照表？

謝謝！

來源

2016-08-30 hossibley

'網址< - 「http://www.pro-football-reference.com/boxscores/201509100nwe.htm#all_vis_snap_counts」單元。計數<- url %>％ read_html（）％>％ html_nodes（xpath ='// * [contains（concat（「」，@class，「」），concat（「」，「table_container」，「」））]'） ''返回一個元素（即''{xml_nodeset（1）}''）列表，但我似乎不能將它轉換爲使用html_table（fill = TRUE）的表格'' –

''http：// www .pro-football-reference.com/boxscores/201509100nwe.htm'％>％read_html（）％>％html_nodes（xpath ='// comment（）'）％>％html_text（）％>％paste（collapse =' '）％>％read_html（）％>％html_node（'table＃home_snap_counts'）％>％html_table（）％>％{setNames（。[ - 1，]，paste0（names（。），。[1，] ））}％>％readr :: type_convert（）' – alistaire

您正在查找的信息在運行時以編程方式顯示。一種解決方案是使用RSelenium。
查看網頁的源代碼時，表中的信息存儲在代碼中，但隱藏是因爲表存儲爲註釋。這裏是我的解決方案，我刪除評論標記並正常重新處理頁面。

我將文件保存到工作目錄，然後使用readLines函數讀取文件。現在我搜索html開始和結束註釋標誌，然後刪除它們。我再次保存該文件（少於註釋標記）以重新讀取和處理選定表的文件。

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" 
gameHtml <- gameUrl %>% read_html() 
gameHtml %>% html_nodes("tbody") 

#Only save and work with the body 
body<-html_node(gameHtml,"body") 
write_xml(body, "nfl.xml") 

#Find and remove comments 
lines<-readLines("nfl.xml") 
lines<-lines[-grep("<!--", lines)] 
lines<-lines[-grep("-->", lines)] 
writeLines(lines, "nfl2.xml") 

#Read the file back in and process normally 
body<-read_html("nfl2.xml") 
html_table(html_nodes(body, "table")[29]) 

#extract the attributes and find the attribute of interest 
a<-html_attrs(html_nodes(body, "table")) 

#find the tables of interest. 
homesnap<-which(sapply(a, function(x){x[2]})=="home_snap_counts") 
html_table(html_nodes(body, "table")[homesnap]) 

visitsnap<-which(sapply(a, function(x){x[2]})=="vis_snap_counts") 
html_table(html_nodes(body, "table")[visitsnap])

來源

2016-08-30 23:22:35 Dave2e

謝謝戴夫！很好的解決方案。 – hossibley

如何使用rvest（）獲取表格

回答

相關問題