2016-09-11 38 views
2

我試圖從擁有多個表的網頁上刮表。我想從https://www.census.gov/geo/reference/ansi_statetables.html獲得「美國各州和哥倫比亞特區的FIPS代碼」表。我認爲XML::readHTMLTable()是正確的道路要走,但是當我嘗試下面我得到一個錯誤:找到html表名並在R

url = "https://www.census.gov/geo/reference/ansi_statetables.html" 
readHTMLTable(url, header = T, stringsAsFactors = F) 

named list() Warning message: XML content does not seem to be XML: ' https://www.census.gov/geo/reference/ansi_statetables.html '

這並不奇怪,當然,因爲我不給函數的任何指示其中表我想讀。我已經在「檢查」中挖了很長一段時間,但我沒有連接點如何更精確。似乎沒有類似於我在文檔或SO上找到的其他示例的表名或類。思考?

+3

我用'readHTMLTable(RCurl :: getURL(url),...)'得到了它 –

回答

3

考慮使用readLines()湊在readHTMLTable() HTML頁面內容和使用結果:

url = "https://www.census.gov/geo/reference/ansi_statetables.html" 
webpage <- readLines(url) 

readHTMLTable(webpage, header = T, stringsAsFactors = F)    # LIST OF 3 TABLES 

# $`NULL` 
#     Name FIPS State Numeric Code Official USPS Code 
# 1    Alabama      01     AL 
# 2    Alaska      02     AK 
# 3    Arizona      04     AZ 
# 4    Arkansas      05     AR 
# 5   California      06     CA 
# 6    Colorado      08     CO 
# 7   Connecticut      09     CT 
# 8    Delaware      10     DE 
# 9 District of Columbia      11     DC 
# 10    Florida      12     FL 
# 11    Georgia      13     GA 
# 12    Hawaii      15     HI 
# 13    Idaho      16     ID 
# 14    Illinois      17     IL 
# ... 

對於具體的數據幀的回報:

fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]] 
1

使用rvest代替XML另一種解決方案是:

require(rvest) 
read_html("https://www.census.gov/geo/reference/ansi_statetables.html") %>% 
    html_table %>% .[[1]]