2015-10-17 54 views
1

雖然網上刮我碰到下面的問題,對此我認爲有可能是一個更好的解決方案:rvest | Webscraping數據爲長格式

有這樣的數據:

dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany")) 

       query 
1 Washington, USA 
2 Frankfurt, Germany 

我想查詢例如Google Maps Api並返回格式化的地址(es)。可能有多種格式。結果應該是以下幾點:

   query   formatted_address 
1 Washington, USA  Washington, DC, USA 
2 Washington, USA  Washington, UT, USA 
3 Washington, USA Washington, VA 22747, USA 
4 Washington, USA Washington, IA 52353, USA 
5 Washington, USA Washington, GA 30673, USA 
6 Washington, USA Washington, PA 15301, USA 
7 Frankfurt, Germany  Frankfurt, Germany 

我現在做的是這樣的:

require(RCurl) 
require(rvest) 
require(magrittr) 

build_url <- function(x, base_url = "https://maps.googleapis.com/maps/api/geocode/xml?address="){ 
    paste0(base_url, RCurl::curlEscape(x)) 
} 

l <- lapply(dat$query, function(q){ 
    formatted_address <- q %>% build_url %>% read_xml %>% xml_nodes("formatted_address") %>% xml_text 
    data.frame(query = q, formatted_address) 
}) 

do.call(rbind, l) # This can be done via data.table::rbindlist as well 

有沒有更好的解決辦法?也許更多data.tabledplyr風格?

+1

請包括'library' /'require'呼籲讓你的代碼可重複 – jangorecki

+0

肯定。剛剛在data.frame創建時添加了'require'語句 – Rentrop

+2

,除了'stringsAsFactors = FALSE'之外,您已經優化了這個完美的IMO。我建議在lappl中添加一個'sleep',並確保將呼叫數量限制爲2500或更少的IIRC([使用限制](https://developers.google.com/maps/documentation/business/articles/usage_limits)info)。 – hrbrmstr

回答

0

我已經編寫了包googleway以使用有效的API密鑰訪問Google地圖API(因此,如果您的數據超過2500個項目,您可以爲API密鑰付款)。

要獲取詳細地址使用google_geocode()

library(googleway) 

key <- "your_api_key" 

dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany")) 

## To get all the data: 
res <- apply(dat, 1, function(x){ 
    google_geocode(address = x["query"], 
       key = key) ## use simplify = F to return JSON 
}) 

## to access the 'formatted address' part, see 
res[[1]]$results$formatted_address 
# [1] "Washington, DC, USA"  "Washington, UT, USA"  "Washington, VA 22747, USA" "Washington, IA 52353, USA" 
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA" 

## so to get everything as a list 
lapply(res, function(x){ 
    x$results$formatted_address 
}) 

# [[1]] 
# [1] "Washington, DC, USA"  "Washington, UT, USA"  "Washington, VA 22747, USA" "Washington, IA 52353, USA" 
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA" 
# 
# [[2]] 
# [1] "Frankfurt, Germany" 

## and to put back onto your original data.frame: 
lst <- lapply(1:length(res), function(x){ 
    data.frame(query = dat[x, "query"], 
      formatted_address = res[[x]]$results$formatted_address) 
}) 

data.table::rbindlist(lst) 
#     query   formatted_address 
# 1: Washington, USA  Washington, DC, USA 
# 2: Washington, USA  Washington, UT, USA 
# 3: Washington, USA Washington, VA 22747, USA 
# 4: Washington, USA Washington, IA 52353, USA 
# 5: Washington, USA Washington, GA 30673, USA 
# 6: Washington, USA Washington, PA 15301, USA 
# 7: Frankfurt, Germany  Frankfurt, Germany