從維基百科加載表到R

我想從以下URL中將最高法院法官表加載到R中。 https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States 從維基百科加載表到R

我使用以下代碼：

scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States" 
scotusData <- getURL(scotusURL, ssl.verifypeer = FALSE) 
scotusDoc <- htmlParse(scotusData) 
scotusData <- scotusDoc['//table[@class="wikitable"]'] 
scotusTable <- readHTMLTable(scotusData[[1]], stringsAsFactors = FALSE)

ř返回scotusTable爲NULL。這裏的目標是在R中獲得一個data.frame，我可以用它來構建一個在法庭上享有SCOTUS正義任期的ggplot。我以前有過這樣的腳本來製作一個很棒的情節，但是最近的決定在頁面上發生了一些變化，現在腳本無法運行。我通過維基百科上的HTML嘗試查找任何更改，但是我不是webdev，因此任何會破壞我的腳本的內容都不會立即顯現。

此外，R中是否有一個方法可以緩存來自此頁面的數據，因此我並不是經常引用該URL？這似乎是今後避免這個問題的理想方式。欣賞幫助。

另外，SCOTUS在我的正在進行的業餘愛好/副項目中，所以如果還有其他的數據源比維基百科更好的話，那麼我就是耳熟能詳。

編輯：對不起，我應該列出我的依賴。我正在使用XML，plyr，RCurl，data.table和ggplot2庫。

來源

2015-07-02 Benjamin Scott

什麼是'getURL'函數的源代碼？ – Frash

http://stackoverflow.com/questions/27843659/scraping-a-complex-html-table-into-a-data-frame-in-r – Khashaa

關於你的問題，你可以考慮在開放的數據堆棧交換站點上詢問。 – Frank

如果您不介意使用不同的包裝，您可以嘗試「rvest」包裝。

library(rvest)  
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"

選項1：抓住從頁面的表格和使用html_table函數提取你感興趣的表

temp <- scotusURL %>% 
    html %>% 
    html_nodes("table") 

html_table(temp[1]) ## Just the "legend" table 
html_table(temp[2]) ## The table you're interested in

選項2：檢查表元素複製XPath以直接讀取該表（右鍵單擊，檢查元素，滾動到相關的「表」標記，右鍵單擊該表並選擇「複製XPath」）。
```
scotusURL %>% 
    html %>% 
    html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% 
    html_table 
```

另一種選擇我喜歡的是加載在谷歌電子表格中的數據，並使用"googlesheets" package閱讀它。

在Google Drive中，創建一個名爲「最高法院」的新電子表格。在第一個工作表中，輸入：

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

這會自動將此表格拖到Google電子表格中。

從那裏，R中，你可以做：

library(googlesheets) 
SC <- gs_title("Supreme Court") 
gs_read(SC)

來源

2015-07-02 06:32:21 A5C1D2H2I1M1N2O1R2T1

'temp = tempfile（）; httr :: GET（wurl，user_agent（「Dogzilla」），write_disk（temp））;表< - XML :: readHTMLTable（temp）;表[[2]]; '給了我和上面代碼一樣的表格，但是你怎麼清理這些年份等等。這些都是混亂的。就像出生/死於第一行一樣，出現在174512121745-1829之間，而實際上卻是1745-1829。不知道多餘角色的來源。 – Frash

@Frash，我不知道這是怎麼發生的，但它似乎是嵌入最後一年的確切日期（12/12/1745）。 – A5C1D2H2I1M1N2O1R2T1

你是對的，wiki頁面以編輯模式顯示該日期。 – Frash

你可以試試這個：

url <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States" 
library(rvest) #v 0.2.0.9000 
the_table <- read_html(url) %>% html_node("table.wikitable:nth-child(11)") %>% html_table()

來源

2015-07-02 06:33:41 RHertel

如果您有'rvest'包的舊版本，則可能需要將'read_html（url）'替換爲'html（url ）'。 – RHertel

出於某種原因，googlesheets依賴是行不通的，所以我把它通過谷歌反正。

我跑：

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

，然後下載的文件爲.csv

不知道爲什麼我之前沒想到的。我將不得不重新編寫我的字符串腳本來清理它，但這最終成爲1）解決我遇到的第一個問題和2）下載文件的最佳方法，以便我不必繼續引用URL 。

感謝您的幫助。

來源

2015-07-02 08:47:01

我會刪除所有<span style="display:none">節點並從scotusDoc中讀取表，而不是嘗試選擇已更改的表類值。

scotusDoc <- htmlParse(scotusData, encoding="UTF-8") 
xpathSApply(scotusDoc, "//span[@style='display:none']", removeNodes) 
x <- readHTMLTable(scotusDoc, which=2,stringsAsFactors=FALSE) 

head(x) 
    #   Judge State Born/Died   Active service Chief Justice Retirement Appointed by Reason for\ntermination 
1 1  John Jay† NY 1745–1829 1789–1795(5–6 years)  1789–1795   — Washington    Resignation 
2 2 John Rutledge SC 1739–1800 1789–1791(1–2 years)    —   — Washington  Resignation[n 1] 
3 3 William Cushing MA 1732–1810 1789–1810(20–21 years)    —   — Washington     Death 
4 4 James Wilson PA 1742–1798 1789–1798(8–9 years)    —   — Washington     Death 
5 5 John Blair, Jr. VA 1732–1800 1789–1795(5–6 years)    —   — Washington    Resignation 
6 6 James Iredell NC 1751–1799 1790–1799(8–9 years)    —   — Washington     Death

這裏是表類，所以第二臺現在是一個「wikitable排序」

xpathSApply(scotusDoc, "//table", xmlGetAttr, "class") 
[1] "wikitable"           "wikitable sortable"        
[3] "navbox"           "nowraplinks collapsible autocollapse navbox-inner" 
[5] "navbox"           "nowraplinks collapsible collapsed navbox-inner

來源

2015-07-02 16:26:32

從維基百科加載表到R

回答

相關問題