R中大文件的操作

我有15個數據文件，每個大約4.5GB。每個文件都是大約17,000個客戶的一個月的數據。總之，這些數據代表了15個月內17,000名客戶的信息。我想重新格式化這些數據，以便每個客戶都有17,000個文件，而不是每個表示一個月的15個文件。我寫了一個腳本來做到這一點：R中大文件的操作

#the variable 'files' is a vector of locations of the 15 month files 
exists = NULL #This vector keeps track of customers who have a file created for them 
for (w in 1:15){ #for each of the 15 month files 
    month = fread(files[w],select = c(2,3,6,16)) #read in the data I want 
    custlist = unique(month$CustomerID) #a list of all customers in this month file 
    for (i in 1:length(custlist)){ #for each customer in this month file 
    curcust = custlist[i] #the current customer 
    newchunk = subset(month,CustomerID == curcust) #all the data for this customer 
    filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is 
    if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back 
     custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file 
     custfile$V1 = NULL #remove an extra column the fread adds 
     custfile= rbind(custfile,newchunk)#combine read in data with our new data 
     write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename))) 
    } else { #if it has not been created, write newchunk to a csv 
     write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename))) 
     exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files 
    } 
    } 
}

腳本的作品（至少，我很確定）。問題是它非常慢。按照我要去的速度，這將需要一週或更長時間才能完成，而我沒有那個時間。在R中做更好，更快的方法來做到這一點？我應該嘗試在SQL這樣的東西中做到這一點嗎？我從來沒有真正使用過SQL;你們中的任何一個人都可以向我展示這樣的事情嗎？任何輸入是不勝感激。

來源

2015-04-12 Ore M

不能完全確定爲什麼帖子被downvoted。我當然不介意，但請讓我知道如果我打破了某種禮儀。或者，也許這個問題太簡單了？ –

有時候，人們會因爲發現這個人沒有做足夠的研究/努力來解決問題而失望。我並不完全同意，你得到我的贊成抵消。 ;） –

你真的想要17,000個文件嗎？ R的能力，一旦文件被。 –

作爲@Dominic Comtois我也建議使用SQL。
R可以處理相當大的數據 - 有超過2億個行的好基準，它可以跳過python - 但是因爲R主要在內存中運行，所以你需要有一臺好的機器來使它工作。儘管如此，您的情況不需要一次加載超過4.5GB的文件，因此它應該可以在個人計算機上很好地運行，請參閱第二種快速非數據庫解決方案。
您可以使用R鍵加載數據到SQL數據庫，後來從數據庫中查詢他們。如果你不知道SQL，你可能想使用一些簡單的數據庫。來自R最簡單的方法是使用RSQLite（不幸的是，因爲1.1版是不是精簡版更多）。您不需要安裝或管理任何外部依賴項。 RSQLite包包含嵌入的數據庫引擎。

library(RSQLite) 
library(data.table) 
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db") 
monthfiles <- c("month1","month2") # ... 
# write data 
for(monthfile in monthfiles){ 
    dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE) 
    cat("data for",monthfile,"loaded to db\n") 
} 
# query data 
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1") 
# when working with bigger sets of data I would recommend to do below 
setDT(df) 
dbDisconnect(conn)

這就是所有。您可以使用SQL，而無需執行通常與數據庫相關的大量開銷。

如果你更願意去與你的文章的方法，我想你可以戲劇性地同時聚集在data.table由集團做write.csv加快。

library(data.table) 
monthfiles <- c("month1","month2") # ... 
# write data 
for(monthfile in monthfiles){ 
    fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID] 
    cat("data for",monthfile,"written to csv\n") 
}

所以你利用來自data.table快速獨特的子集進行分組，同時這也是超快。以下是該方法的工作示例。

library(data.table) 
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]

更新2016年12月5日：
從data.table開始1.9.8+您可以在this answer與fwrite取代write.csv，例如。

來源

2015-04-12 20:23:45 jangorecki

偉大的解決方案，謝謝。第二種解決方案非常聰明，我沒有想到要這樣做。我將嘗試弄清楚RSQLite包。謝謝 –

我想你已經有了你的答案。但是，以加強它，看官方文檔

R Data Import Export

各國

一般情況下，如R統計系統並不特別適合於大規模數據的操作。其他一些系統比R更好，本手冊的部分推力是建議不要在R中複製功能，我們可以讓做另一個系統的工作！（例如，Therneau & Grambsch（2000）評論說他們傾向於在SAS中進行數據處理，然後在S中使用包存活進行分析。）數據庫操作系統通常非常適合操作和提取數據：幾個包在這裏討論與DBMS交互。

很明顯，海量數據的存儲並不是R的主要優勢，但它提供了幾個專用於此的工具的接口。在我自己的工作中，輕量級的SQLite解決方案就足夠了，即使它在某種程度上是一個偏好問題。搜索「使用SQLite的缺點」，並且你可能找不到太多來勸阻你。

你會發現SQLite's documentation很順利。如果你有足夠的編程經驗，做一兩個教程應該讓你在SQL前面很快地完成任務。我沒有看到代碼中發生的任何過於複雜的事情，所以最常見的基本查詢如CREATE TABLE，SELECT ... WHERE可能會滿足您的所有需求。

編輯

使用，我沒有提到一個DBMS的另一個好處是，你可以有views，使容易獲得其他數據組織schemas如果一個可能會說。通過創建視圖，您可以回到「每月可視化」，而無需重寫任何表格或重複任何數據。

來源

2015-04-12 19:06:04

太好了。我不確定SQL是否是正確的路，所以確認有幫助。感謝您的參考。 –

當然np！我添加了一些關於您可能會感興趣的視圖的信息。 –

非常有趣。考慮到我正在做的工作，這看起來像是一個非常有用的技能。謝謝！ –

R中大文件的操作

回答

相關問題