2015-08-24 75 views
1

我想要廢除超鏈接中的地理編碼,並且想要將所有表格與地理編碼一起製成表格。rvest獲取表格中的超鏈接

我做了什麼,現在是通過使用下面的代碼

library(rvest) 

url<-"http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html" 

citidata<- html(url) 
ta<- citidata %>% 
html_nodes("table") %>% 
.[1:29] %>% 
html_table() 

dat<-do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE)) 

citystate <- citidata %>% 
html_node("h1 span") %>% 
html_text() 

citystate <- gsub("Fatal car crashes and road traffic accidents in ", 
        "", citystate) 

loc<-data.frame(matrix(unlist(strsplit(citystate, ",", fixed = TRUE)), ncol=2, byrow=TRUE)) 
dat$City<-loc$X1 
dat$State<-loc$X2 

得到一個表,我得到這個

Date,Location,Vehicles,Drunken.persons,Fatalites,Persons,Pedestrians,City,State 
1 Jun 26, 2013 87:99 PM, Temple Street, 1, -, 1, 1, -, Nashua, New Hampshire 

然後我嘗試在地理編碼加入到數據幀,但不知道如何去做。

下面是在超鏈接中廢除地理編碼的代碼。

pg <- html("http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html") 
geo <- data.frame(gsub("javascript:showGoogleSView","",pg %>% html_nodes("a") %>% html_attr("href") %>% .[31:60])) 
+0

一個問題(最初)是'dat'有98行,地緣' '有30 – hrbrmstr

+0

是的,並不是所有的數據都帶有地理位置。 – Jen

回答

1

並非所有事件都具有關聯的經/緯度對。下面的代碼使用的事實,事件發生的日期是(顯然)獨特的合併,你前面建有主dat座標:

library(rvest) 
library(stringr) 
library(dplyr) 

url <- "http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html" 

# Get all incident tables ------------------------------------------------- 

citidata <- html(url) 

ta <- citidata %>% 
    html_nodes("table") %>% 
    .[1:29] %>% 
    html_table() 

# rbind them together ----------------------------------------------------- 

dat <- do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE)) 

citystate <- citidata %>% 
    html_node("h1 span") %>% 
    html_text() 

# Get city/state and add it to the data.frame ------------------------------- 

citystate <- gsub("Fatal car crashes and road traffic accidents in ", 
        "", citystate) 

loc <- data.frame(matrix(unlist(strsplit(citystate, ",", fixed=TRUE)), 
         ncol=2, byrow=TRUE)) 

dat$City <- loc$X1 
dat$State <- loc$X2 

# Get GPS coords where available ------------------------------------------ 

coords <- citidata %>% 
    html_nodes(xpath="//a[@class='showStreetViewLink']") %>% 
    html_attr("href") %>% 
    str_extract("([[:digit:]-,\\.]+)") %>% 
    str_split(",") %>% 
    unlist() %>% 
    matrix(ncol=2, byrow=2) %>% 
    data.frame(stringsAsFactors=FALSE) %>% 
    rename(lat=X1, lon=X2) %>% 
    mutate(lat=as.numeric(lat), lon=as.numeric(lon)) 

# Get GPS coordinates associated incident time for merge ------------------ 

coord_time <- pg %>% 
    html_nodes(xpath="//a[@class='showStreetViewLink']/../preceding-sibling::td[1]") %>% 
    html_text() %>% 
    data_frame(Date=.) 

# Merge the coordinates with the data.frame we built earlier -------------- 

left_join(dat, bind_cols(coords, coord_time)) 
+0

是的,有一些不可用的連接線,我想我可以在將它們合併在一起之前將它們分開。但另一個問題是,如果他們不是按順序排列的(中間缺少),我該如何與時間匹配呢? – Jen