我有15個數據文件,每個大約4.5GB。每個文件都是大約17,000個客戶的一個月的數據。總之,這些數據代表了15個月內17,000名客戶的信息。我想重新格式化這些數據,以便每個客戶都有17,000個文件,而不是每個表示一個月的15個文件。我寫了一個腳本來做到這一點:R中大文件的操作
#the variable 'files' is a vector of locations of the 15 month files
exists = NULL #This vector keeps track of customers who have a file created for them
for (w in 1:15){ #for each of the 15 month files
month = fread(files[w],select = c(2,3,6,16)) #read in the data I want
custlist = unique(month$CustomerID) #a list of all customers in this month file
for (i in 1:length(custlist)){ #for each customer in this month file
curcust = custlist[i] #the current customer
newchunk = subset(month,CustomerID == curcust) #all the data for this customer
filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
custfile$V1 = NULL #remove an extra column the fread adds
custfile= rbind(custfile,newchunk)#combine read in data with our new data
write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
} else { #if it has not been created, write newchunk to a csv
write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
}
}
}
腳本的作品(至少,我很確定)。問題是它非常慢。按照我要去的速度,這將需要一週或更長時間才能完成,而我沒有那個時間。在R中做更好,更快的方法來做到這一點?我應該嘗試在SQL這樣的東西中做到這一點嗎?我從來沒有真正使用過SQL;你們中的任何一個人都可以向我展示這樣的事情嗎?任何輸入是不勝感激。
不能完全確定爲什麼帖子被downvoted。我當然不介意,但請讓我知道如果我打破了某種禮儀。或者,也許這個問題太簡單了? –
有時候,人們會因爲發現這個人沒有做足夠的研究/努力來解決問題而失望。我並不完全同意,你得到我的贊成抵消。 ;) –
你真的想要17,000個文件嗎? R的能力,一旦文件被。 –