刮複雜的HTML表格到data.frame中的R

我嘗試維基百科上關於美國最高法院大法官的數據加載到R：刮複雜的HTML表格到data.frame中的R

library(rvest) 

html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States") 
judges = html_table(html_nodes(html, "table")[[2]]) 
head(judges[,2]) 

[1] "Wilson, JamesJames Wilson"  "Jay, JohnJohn Jay†"    
[3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."  
[5] "Rutledge, JohnJohn Rutledge"  "Iredell, JamesJames Iredell"

的問題是，數據格式不正確。而不是名字出現在我的實際HTML表格（「詹姆斯威爾遜」）中，它實際上出現了兩次，一次是「姓氏，名字」，然後又是「名字姓氏」。

的原因是每個實際上包含了一種無形的：

<td style="text-align:left;" class=""> 
    <span style="display:none" class="">Wilson, James</span> 
    <a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a> 
</td>

同樣也是與數字數據的列真。我猜測這個額外的代碼是排序HTML表格所必需的。不過，我想在河

創建從表中data.frame當我不清楚如何刪除這些跨度

來源

2015-01-08 Ari

也許這樣

library(XML) 
library(rvest) 
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States") 
judges = html_table(html_nodes(html, "table")[[2]]) 
head(judges[,2]) 
# [1] "Wilson, JamesJames Wilson"  "Jay, JohnJohn Jay†"    "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."  
# [5] "Rutledge, JohnJohn Rutledge"  "Iredell, JamesJames Iredel 

removeNodes(getNodeSet(html, "//table/tr/td[2]/span")) 
judges = html_table(html_nodes(html, "table")[[2]]) 
head(judges[,2]) 
# [1] "James Wilson" "John Jay†"  "William Cushing" "John Blair, Jr." "John Rutledge" "James Iredell"

來源

2015-01-08 15:59:31 lukeA

你可以使用rvest

library(rvest) 

html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")%>% 
    html_nodes("span+ a") %>% 
    html_text()

這不是完美的，所以你可能要細化CSS選擇器，但它讓你相當接近。

來源

2015-01-08 15:50:08

刮複雜的HTML表格到data.frame中的R

回答

相關問題