2016-03-03 92 views
3

ffbase提供了功能ffdfdply來分割和聚合數據行。這個答案(https://stackoverflow.com/a/20954315/336311)解釋了這基本上可以工作。我仍然無法弄清楚如何分割多列。如何將多個列拆分/聚合大型數據框(ffdf)?

我的挑戰是分裂變量是必需的。對於兩個變量的每個組合,這個必須是唯一的,我想分開。不過,在我的4列數據框(大約50M行)中,如果通過paste()創建字符向量,則需要大量內存。

這是我卡住了...

require("ff") 
require("ffbase") 
load.ffdf(dir="ffdf.shares.02") 

# Aggregation by articleID/measure 
levels(ffshares$measure) # "comments", "likes", "shares", "totals", "tw" 
splitBy = paste(as.character(ffshares$articleID), ffshares$measure, sep="") 

tmp = ffdfdply(fftest, split=splitBy, FUN=function(x) { 
    return(list(
    "articleID" = x[1,"articleID"], 
    "measure" = x[1,"measure"], 
    # I need vectors for each entry 
    "sx" = unlist(x$value), 
    "st" = unlist(x$time) 
)) 
} 
) 

當然,我可以用更短的水平ffshares$measure或簡單地用數字代碼,但是這是splitBy增長仍然不會解決根本問題非常大。

樣本數據

articleID measure    time value 
100  41 shares 2015-01-03 23:20:34  4 
101  41  tw 2015-01-03 23:30:30 24 
102  41 totals 2015-01-03 23:30:38  6 
103  41 likes 2015-01-03 23:30:38  2 
104  41 comments 2015-01-03 23:30:38  0 
105  41 shares 2015-01-03 23:30:38  4 
106  41  tw 2015-01-03 23:40:24 24 
107  41 totals 2015-01-03 23:40:35  6 
108  41 likes 2015-01-03 23:40:35  2 
... 
1000  42 shares 2015-01-04 20:10:50  0 
1001  42  tw 2015-01-04 21:10:45 24 
1002  42 totals 2015-01-04 21:10:35  0 
1003  42 likes 2015-01-04 21:10:35  0 
1004  42 comments 2015-01-04 21:10:35  0 
1005  42 shares 2015-01-04 21:10:35  0 
1006  42  tw 2015-01-04 22:10:45 24 
1007  42 totals 2015-01-04 22:10:43  0 
1008  42 likes 2015-01-04 22:10:43  0 
... 
+0

你能提供的示例數據? –

+0

不客氣。這是非常簡單的數據 - 只是很多:) – BurninLeo

回答

3
# Use this, this makes sure your data does not get into RAM completely but only in chunks of 100000 records 
ffshares$splitBy <- with(ffshares[c("articleID", "measure")], paste(articleID, measure, sep=""), 
         by = 100000) 
length(levels(ffshares$splitBy)) ## how many levels are in there - don't know from your question 

tmp <- ffdfdply(ffshares, split=ffshares$splitBy, FUN=function(x) { 
    ## In x you are getting a data.frame in RAM with all records of possibly several articleID/measure combinations 
    ## You should write a function which returns a data.frame. E.g. the following returns the mean value by articleID/measure and the first and last timepoint 
    x <- data.table::setDT(x) 
    xagg <- x[, list(value = mean(value), 
        first.timepoint = min(time), 
        last.timepoint = max(time)), by = list(articleID, measure)] 
    ## the function should return a data frame as indicated in the help of ffdfdply, not a list 
    setDF(xagg) 
}) 
## tmp is an ffdf 
+0

呵呵,paste()和ffdfdply()命令都讓R工作了一段時間。可能是由於我的數據中有40萬個錯位。儘管如此,你的解決方案還是有效的非常感謝! – BurninLeo