2015-01-08 85 views
4

我想應用一個循環來從R中的多個網頁中抓取數據。我能夠抓取一個網頁的數據,但是當我嘗試爲多個頁面使用一個循環時,我得到一個令人沮喪的錯誤。我花了數小時修補,無濟於事。任何幫助將不勝感激!!!如何使用循環來抓取R中多個網頁的網站數據?

這工作:

########################### 
# GET COUNTRY DATA 
########################### 

library("rvest") 

site <- paste("http://www.countryreports.org/country/","Norway",".htm", sep="") 
site <- html(site) 

stats<- 
    data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
     facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$country <- "Norway" 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
View(stats) 

然而,當我試圖在一個循環來寫這篇文章,我收到一條錯誤

########################### 
# ATTEMPT IN A LOOP 
########################### 

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 

for(i in country){ 

site <- paste("http://www.countryreports.org/country/",country,".htm", sep="") 
site <- html(site) 

stats<- 
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
     facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$country <- country 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 

stats<-rbind(stats,stats) 
stats<-stats[!duplicated(stats),] 
} 

錯誤:

Error: length(url) == 1 is not TRUE 
In addition: Warning message: 
In if (grepl("^http", x)) { : 
    the condition has length > 1 and only the first element will be used 
+0

相同的結果在這裏。我試過這段代碼,即使在非循環工作時也得到相同的錯誤信息! >長度(站點) [1] 7 > stopifnot(長度(站點)== 1) 錯誤:長度(站點)== 1不是TRUE – lawyeR

+1

在此行上:'site < - paste(「http:/ /www.countryreports.org/country/",country,".htm「,sep =」「)'您正在使用'country',即在循環版本中,與您所有國家/地區的字符向量。你可能想要'i'這是你的國家媒介的一個元素。 – zelite

+0

zelite - 讓我更加接近 - 謝謝。 –

回答

5

最後工作的代碼:

########################### 
# THIS WORKS!!!! 
########################### 

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 

for(i in country){ 

site <- paste("http://www.countryreports.org/country/",i,".htm", sep="") 
site <- html(site) 

stats<- 
data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
    facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
     stringsAsFactors=FALSE) 

stats$nm <- i 
stats$names <- gsub('[\r\n\t]', '', stats$names) 
stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
#stats<-stats[!duplicated(stats),] 
all<-rbind(all,stats) 

} 
View(all) 
+1

這真的對你有用嗎?爲了做類似的事情,所以運行你的代碼並收到以下錯誤:rep(xi,length.out = nvar)中的錯誤: 試圖複製'builtin'類型的對象。你之前在某個地方發起過「全部」嗎? –

0

這就是我所做的。這不是最好的解決方案,但你會得到一個輸出。這也只是一個解決方法。我不建議您在運行循環時將表輸出寫入文件。幹得好。輸出從stats生成後,

output<-rbind(stats,i) 

然後寫表,

write.table(output, file = "D:\\Documents\\HTML\\Test of loop.csv", row.names = FALSE, append = TRUE, sep = ",") 

#then close the loop 
} 

好運

1

就initalize循環之前的空數據幀。 我已經做了這個問題,下面的代碼適合我。

country<-c("Norway","Sweden","Finland","France","Greece","Italy","Spain") 
df <- data.frame(names = character(0),facts = character(0),nm = character(0)) 

for(i in country){ 

    site <- paste("http://www.countryreports.org/country/",i,".htm", sep="") 
    site <- html(site) 

    stats<- 
    data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , 
       facts =site %>% html_nodes(xpath="//*/td[2]") %>% html_text() , 
       stringsAsFactors=FALSE) 

    stats$nm <- i 
    stats$names <- gsub('[\r\n\t]', '', stats$names) 
    stats$facts <- gsub('[\r\n\t]', '', stats$facts) 
    #stats<-stats[!duplicated(stats),] 
    #all<-rbind(all,stats) 
    df <- rbind(df, stats) 
    #all <- merge(Output,stats) 

} 
View(df)