řdata.frame流數據前處理聚合時間統計

是什麼處理等řdata.frame流數據前處理聚合時間統計

> df <- data.frame(amount=c(4,3,1,1,4,5,9,13,1,1), size=c(164,124,131,315,1128,331,1135,13589,164,68), tot=1, first=c(1,1,3,3,2,2,2,2,4,4), secs=c(2,2,0,0,1,1,1,1,0,0)) 
> df 
    amount size tot first secs 
1  4 164  1  1 2 
2  3 124  1  1 2 
3  1 131  1  3 0 
4  1 315  1  3 0 
5  4 1128  1  2 1 
6  5 331  1  2 1 
7  9 1135  1  2 1 
8  13 13589  1  2 1 
9  1 164  1  4 0 
10  1 68  1  4 0

流data.frame到每時間彙總數據的最有效方式

> df2 
    time tot amount size 
1 1 2 3.5 144 
2 2 6 34.5 16327 
3 3 8 36.5 16773 
4 4 2 2.0 232

..使用R，當實際的數據集可以超過1億行甚至數十千兆字節？

列first表示持續時間爲secs的流程開始，其度量標準爲amount,size和tot。在合計合計中，size和amount以雙精度均分到時間範圍，而tot以整數形式求和到每個時隙。持續時間secs表示流量除了數值first之外還有多少個時隙：如果secs爲1並且first爲5，則該流持續時隙5和6.我當前的實現使用醜陋和死 - 慢循環，其中是不是一個選項：你可能可以優化這個很多，並使用循環獲得良好的性能，但我敢打賭，有更好的算法存在。也許你能在某種程度上expand/duplicate與secs > 0行，同時增加了擴展行first（時間戳）值和動態調整amount，size，並且tot指標：

now original data.. 

    amount size tot first secs 
1  4 164  1  1 0 
2  4 164  1  1 1 
3  3 124  1  1 2 


magically becomes 

    amount size tot first 
1  4 164  1  1 
2  2 82  1  1 
3  2 82  1  2 
4  1 41.33  1  1 
5  1 41.33  1  2 
6  1 41.33  1  3

這個預處理步驟聚合後會使用plyr ddply當然以高效的並行模式是微不足道的。

所有示例ddply，apply等函數示例我能夠在每行或每列的基礎上找到操作，因此很難修改其他行。希望我不必依賴awk-magic。

更新：當擴展按「原樣」完成時，上述算法可能會很容易耗盡您的內存。因此某些「即時」計算是首選，我們不會將所有內容映射到內存。然而，Mattrition的答案是正確的，並且有很大幫助，因此將其標記爲已接受的答案。

來源

2014-05-15 ylijumala

您需要用簡單的術語來解釋如何從輸入到問題中顯示的輸出。 – Roland

您當前的實現也有一些語法和其他錯誤。事實上，它不會產生任何輸出。我什至不能看到它會如何產生你的建議輸出，因爲你從來沒有分配任何東西到一個名爲「時間」的列。 – MattLBeck

是的。添加了簡要說明和「時間」分配。 – ylijumala

以下是使用data.table的實現。我選擇data.table作爲聚合能力，但它也是一個漂亮而高效的課程。

library(data.table) 

dt <- as.data.table(df) 

# Using the "expand" solution linked in the Q. 
# +1 to secs to allow room for 0-values 
dtr <- dt[rep(seq.int(1, nrow(dt)), secs+1)] 

# Create a new seci column that enumerates sec for each row of dt 
dtr[,seci := dt[,seq(0,secs),by=1:nrow(dt)][,V1]] 

# All secs that equal 0 are changed to 1 for later division 
dtr[secs==0, secs := 1] 

# Create time (first+seci) and adjusted amount and size columns 
dtr[,c("time", "amount2", "size2") := list(first+seci, amount/secs, size/secs)] 

# Aggregate selected columns (tot, amount2, and size2) by time 
dtr.a <- dtr[,list(tot=sum(tot), amount=sum(amount2), size=sum(size2)), by=time] 


dtr.a 
    time tot amount size 
1: 1 2 3.5 144 
2: 2 6 34.5 16327 
3: 3 8 36.5 16773 
4: 4 2 2.0 232

來源

2014-05-15 12:13:05 MattLBeck

感謝您的迴應，這是一個很好的介紹data.table基礎知識！不幸的是，在我的情況下，第一行'dtr < - dt [rep（seq.int（1，nrow（dt）），secs + 1）]'太耗費資源，整個數據集大約有100 000 000行，所以我用完了內存：'錯誤：不能分配大小爲6.8Gb的向量。稍後我會嘗試將此組分割成更小的塊。 – ylijumala

聽起來像你需要一些比首先生成全尺寸數據集更復雜的東西... – MattLBeck

是的。我通過使用具有更多RAM的服務器解決了這種情況。您可以輕鬆編寫一個循環來將擴展值保存到單獨的文件，然後從那裏彙總統計數據。 – ylijumala

řdata.frame流數據前處理聚合時間統計

回答

相關問題