2014-02-05 43 views
0

我的CSV文件類似於如下 -如何避免'for循環'在R中更快地處理CSV文件?

Data1.csv

BusinessNeedParent,BusinessNeedChild,Identifier 
a1,b1,45 
a2,b2,60 
a3,b3,56 

Data2.csv

AdvertiserName,BusinessNeedNumber,State,City 
worker,45,Calif,Los angeles 
workplace,45,Calif,San Diego 
platoon,60,Connec,Bridgeport 
teracota,56,New York,Albany 

我想要的輸出:

AdvertiserName,BusinessNeedParent,BusinessNeedChild,State,City 
worker,a1,b1,Calif,Los angeles 
workplace,a1,b1,Calif,San Diego 
platoon,a2,b2,Connec,Bridgeport 
teracota,a3,b3,New York,Albany 

所以它必須匹配帶有BusinessNeedNumber的標識符並生成CSV文件上方的數據。 到目前爲止,我的代碼是這樣

record <- read.csv("Data1.csv",header=TRUE) 
businessneedinformation <- read.csv("Data2.csv",header=TRUE) 

for(i in record$BusinessNeedNumber){ 
    if(i %in% businessneedinformation$Identifier){ 
    keyword <- "NA" 
    busparent <- businessneedinformation$BusinessNeedParent[which(businessneedinformation$Identifier==i)] 
    buschild <- businessneedinformation$BusinessNeedChild[which(businessneedinformation$Identifier==i)] 
    replacementbusparent <- gsub(pattern=",",replacement="",x=busparent) 
    replacementbuschild <- gsub(pattern=",",replacement="",x=buschild) 
    campname <- paste("cat","|","bus","|","en-us","|",(tolower(as.character(replacementbusparent[1]))),"|",(tolower(as.character(replacementbuschild[1]))),sep="") 
    thislist <- data.frame(Keyword = keyword,BusinessNeedParent = businessneedinformation$BusinessNeedParent[which(businessneedinformation$Identifier==i)],BusinessNeedChild = businessneedinformation$BusinessNeedChild[which(businessneedinformation$Identifier==i)],Campaign=campname) 
    } 
List <- rbind(List, thislist) 
} 

由於我使用for循環,這是非常緩慢的,幾乎10萬條目它花費很長的時間,什麼是更快地實現其R中使用索引的方式。

+0

如果速度再一個問題使用'fread'了'read.csv'。或者至少用'colClasses'參數指定數據類型。 –

+1

增加了另一種使用'Reduce'方法的方法 – RUser

回答

1
> zz <- "BusinessNeedParent,BusinessNeedChild,Identifier 
a1,b1,45 
a2,b2,60 
a3,b3,56" 
> Data <- read.table(text=zz, header = TRUE,sep=',') 
> Data 
    BusinessNeedParent BusinessNeedChild Identifier 
1     a1    b1   45 
2     a2    b2   60 
3     a3    b3   56 
> zz1 <- "AdvertiserName,BusinessNeedNumber,State,City 
worker,45,Calif,Los angeles 
workplace,45,Calif,San Diego 
platoon,60,Connec,Bridgeport 
teracota,56,New York,Albany" 
> Data1 <- read.table(text=zz1, header = TRUE,sep=',') 
> Data1 
    AdvertiserName BusinessNeedNumber State  City 
1   worker     45 Calif Los angeles 
2  workplace     45 Calif San Diego 
3  platoon     60 Connec Bridgeport 
4  teracota     56 New York  Albany 
> m <- merge(Data,Data1,by.x="Identifier",by.y="BusinessNeedNumber") 
> m[,c(4,2,3,5,6)] 
    AdvertiserName BusinessNeedParent BusinessNeedChild State  City 
1   worker     a1    b1 Calif Los angeles 
2  workplace     a1    b1 Calif San Diego 
3  teracota     a3    b3 New York  Albany 
4  platoon     a2    b2 Connec Bridgeport 
write.csv(m, file = "demoMerge.csv") 

,或者您可以使用

m1 <- Reduce(function(old, new) { merge(old, new, by.x='Identifier', by.y='BusinessNeedNumber') }, list_of_files) 
> m1 
    Identifier BusinessNeedParent BusinessNeedChild AdvertiserName State  City 
1   45     a1    b1   worker Calif Los abngles 
2   45     a1    b1  workplace Calif San Diego 
3   56     a3    b3  teracota New York  Albany 
4   60     a2    b2  platoon Connec Bridgeport 
+0

這個合併對我來說工作正常,但是從Data1返回'm'的長度不同,而它們應該具有相同的長度,我不明白髮生了什麼。另外'合併'到底是做什麼的? – user3188390

+0

'?merge'應該指導你 – RUser