循環的URL和存儲信息在R

我想寫一個for循環，將循環通過許多網站，並提取一些元素，並將結果存儲在R表中。這是我到目前爲止，只是不知道如何啓動for循環，或者將所有結果複製到一個變量中以稍後導出。循環的URL和存儲信息在R

library("dplyr") 
library("rvest") 
library("leaflet") 
library("ggmap") 


url <- c(html("http://www.webiste_name.com/") 

agent <- html_nodes(url,"h1 span") 
fnames<-html_nodes(url, "#offNumber_mainLocContent span") 
address <- html_nodes(url,"#locStreetContent_mainLocContent") 

scrape<-t(c(html_text(agent),html_text(fnames),html_text(address))) 


View(scrape)

來源

2016-07-28 CHopp

我會去與lapply。

的代碼會是這個樣子：

library("rvest") 
library("dplyr") 

#a vector of urls you want to scrape 
URLs <- c("http://...1", "http://...2", ....) 

df <- lapply(URLs, function(u){ 

     html.obj <- read_html(u) 
     agent <- html_nodes(html.obj,"h1 span") %>% html_text 
     fnames<-html_nodes(html.obj, "#offNumber_mainLocContent span") %>% html_text 
     address <- html_nodes(html.obj,"#locStreetContent_mainLocContent") %>% html_text 

    data.frame(Agent=agent, Fnames=fnames, Address=address) 
}) 

df <- do.all(rbind, df) 

View(df)

來源

2016-07-28 22:32:50

工作太棒了！我如何調整以確保每個刮片的數據存儲在單獨的行中？現在它將它們全部存儲在彼此相鄰的地方 – CHopp

我不知道我理解你的問題。在'lapply'的data.frame中，你可以使用下面的'data.frame（Agent = agent，Fnames = fnames，Address = address，URL = u）'爲每一行產生相應的url –

我想出來了，但另一個問題，爲什麼我會在嘗試搜索網站時遇到這樣的錯誤「錯誤：'www.website.com'在當前工作目錄中不存在」 – CHopp

鑑於你的問題不是完全可重複的，這裏是通過三個URL（紅襪，藍鳥和洋基隊）循環的玩具例子：

library(rvest) 

# teams 
teams <- c("BOS", "TOR", "NYY") 

# init 
df <- NULL 

# loop 
for(i in teams){ 
    # find url 
    url <- paste0("http://www.baseball-reference.com/teams/", i, "/") 
    page <- read_html(url) 
    # grab table 
    table <- page %>% 
     html_nodes(css = "#franchise_years") %>% 
     html_table() %>% 
     as.data.frame() 
    # bind to dataframe 
    df <- rbind(df, table) 
} 

# view captured data 
View(df)

循環工作，因爲它取代i在paste0按順序與每個隊伍。

來源

2016-07-28 21:05:25 emehex

循環的URL和存儲信息在R

回答

相關問題