2017-06-06 63 views
1

我剛剛收到一個項目,其中有幾個我們想要放入表格和分析的大型文本文件(每個文件接近千兆字節)。每個文本文件由一年的數據組成,每個數據點來自三個類別中的一個,我們希望的最終結果是每個類別的一個列表,每個類別的觀察結果都包含列。將多個文件讀取到多個列表中

現在的做法是將每個文件讀入列表中,然後根據類別拆分這些列表並每年創建三個新列表,然後將不同年份的給定類別的所有列表轉換爲最終名單。見下面的R文件我提出(匿名):

Year1 <- read.table(YearOneFilePath) 
table(Year1$category) 
Year1A <- Year1[Year1$category == "A",] 
Year1B <- Year1[Year1$category == "B",] 
Year1C <- Year1[Year1$category == "C",] 
rm(Year1) 
Year2 <- read.table(YeartwoFilePath) 
table(Year2$category) 
Year2A <- Year2[Year2$category == "A",] 
Year2B <- Year2[Year2$category == "B",] 
Year2C <- Year2[Year2$category == "C",] 
rm(Year2) 
Year3 <- read.table(YearThreeFilePath) 
table(Year3$category) 
Year3A <- Year3[Year3$category == "A",] 
Year3B <- Year3[Year3$category == "B",] 
Year3C <- Year3[Year3$category == "C",] 
rm(Year3) 


A <- rbind(Year1A, Year2A, Year3A) 
B <- rbind(Year1B, Year2B, Year3B) 
C <- rbind(Year1C, Year2C, Year3C) 

rm(Year1A) 
rm(Year2A) 
rm(Year3A) 
rm(Year1B) 
rm(Year2B) 
rm(Year3B) 
rm(Year1C) 
rm(Year2C) 
rm(Year3C) 

這似乎對我來說,它讀取所有的數據形成文件,並複製它的兩倍,而移動它,這與大量數據的像這需要很長時間和很多內存。很顯然,我可以通過將YearX[YearX$Category == "Y",]直接放入rbind函數來避開YearXY列表,但這仍然意味着我在執行過程中的某個時刻有兩個完整副本。有沒有辦法使每個文件只有一個讀取文件中的文件的最後ABC列表,並且不需要額外複製所有數據?

+0

是實際的對象 - [R這麼大?在驅動器上每個文件可能爲1 Gb,但在R中加載一次的情況要少得多,特別是如果每​​個字段的值不是很明顯。你可以在Rstudio的環境面板上檢查大小,或者使用'object.size'函數。 –

+0

考慮在你的最新步驟中使用'rbindlist'更快:[link](https://stackoverflow.com/questions/15673550/爲什麼是rbindlist-better-than-rbind) –

+0

也許[this](https://stackoverflow.com/questions/9573055/r-selecting-subset-without-copying)如果加載幾次表是好的只要你不創建副本。 –

回答

0
library(data.table) 

Year1 <- fread(YearOneFilePath) 
Year1[, .N ,by = category] 
Year1A <- Year1[Year1$category == "A",,] 
Year1B <- Year1[Year1$category == "B",,] 
Year1C <- Year1[Year1$category == "C",,] 
rm(Year1) 
gc() 
#YES garbage collection may help ;) 
A <- rbind(Year1A, Year2A, Year3A) 
rm(Year1A) 
rm(Year2A) 
rm(Year3A) 
gc() 

對於這裏分裂是一個多方法,

split_list1=split(Year1 ,Year1$category) 
Year1A <-split_list1[[1]] 
Year1B <-split_list1[[2]] 
Year1C <-split_list1[[3]] 

也看到split data table to small tables R