ffbase
提供了功能ffdfdply
來分割和聚合數據行。這個答案(https://stackoverflow.com/a/20954315/336311)解釋了這基本上可以工作。我仍然無法弄清楚如何分割多列。如何將多個列拆分/聚合大型數據框(ffdf)?
我的挑戰是分裂變量是必需的。對於兩個變量的每個組合,這個必須是唯一的,我想分開。不過,在我的4列數據框(大約50M行)中,如果通過paste()
創建字符向量,則需要大量內存。
這是我卡住了...
require("ff")
require("ffbase")
load.ffdf(dir="ffdf.shares.02")
# Aggregation by articleID/measure
levels(ffshares$measure) # "comments", "likes", "shares", "totals", "tw"
splitBy = paste(as.character(ffshares$articleID), ffshares$measure, sep="")
tmp = ffdfdply(fftest, split=splitBy, FUN=function(x) {
return(list(
"articleID" = x[1,"articleID"],
"measure" = x[1,"measure"],
# I need vectors for each entry
"sx" = unlist(x$value),
"st" = unlist(x$time)
))
}
)
當然,我可以用更短的水平ffshares$measure
或簡單地用數字代碼,但是這是splitBy
增長仍然不會解決根本問題非常大。
樣本數據
articleID measure time value
100 41 shares 2015-01-03 23:20:34 4
101 41 tw 2015-01-03 23:30:30 24
102 41 totals 2015-01-03 23:30:38 6
103 41 likes 2015-01-03 23:30:38 2
104 41 comments 2015-01-03 23:30:38 0
105 41 shares 2015-01-03 23:30:38 4
106 41 tw 2015-01-03 23:40:24 24
107 41 totals 2015-01-03 23:40:35 6
108 41 likes 2015-01-03 23:40:35 2
...
1000 42 shares 2015-01-04 20:10:50 0
1001 42 tw 2015-01-04 21:10:45 24
1002 42 totals 2015-01-04 21:10:35 0
1003 42 likes 2015-01-04 21:10:35 0
1004 42 comments 2015-01-04 21:10:35 0
1005 42 shares 2015-01-04 21:10:35 0
1006 42 tw 2015-01-04 22:10:45 24
1007 42 totals 2015-01-04 22:10:43 0
1008 42 likes 2015-01-04 22:10:43 0
...
你能提供的示例數據? –
不客氣。這是非常簡單的數據 - 只是很多:) – BurninLeo