2014-11-24 46 views
4

我無法在R(使用data.table)中找到解決方案來分組數據通過自定義範圍(例如-18,18-25,...,65+)而不是單個值。通過自定義範圍(例如,-18,18-25,...,65+)R(data.table)組數據

我現在使用:

DT[,list(M_Savings=mean(Savings), M_Term=mean(Term)), by=Age] [order (Age)] 

這給了我以下結果:

Age  M_Savings M_Term 
1: 18  6500  5.5 
2: 19  7000  6.2 
3: 20  7200  5.8 
... 
50: 68  4000  4.2 

理想的結果:

Age  M_Savings M_Term 
1: 18-25 7450  5.5 
2: 25-30 8320  6.2 
... 
50: 65+  3862  4.3 

我希望我的解釋是明確的足夠。 將不勝感激任何形式的幫助。

+2

您是否嘗試過使用'cut'? – jdharrison 2014-11-24 14:46:01

+0

或'findInterval'(應該更快 – mnel 2014-11-24 22:54:28

回答

7

@jdharrison是正確的:cut(...)是要走的路。

library(data.table) 
# create sample - you have this already 
set.seed(1) # for reproducibility 
DT <- data.table(age=sample(15:70,1000,replace=TRUE), 
       value=rpois(1000,10)) 

# you start here... 
breaks <- c(0,18,25,35,45,65,Inf) 
DT[,list(mean=mean(value)),by=list(age=cut(age,breaks=breaks))][order(age)] 
#   age  mean 
# 1: (0,18] 10.000000 
# 2: (18,25] 9.579365 
# 3: (25,35] 10.158192 
# 4: (35,45] 9.775510 
# 5: (45,65] 9.969697 
# 6: (65,Inf] 10.141414 
+0

我對你的回答感激不盡 - cut()正是我正在尋找的東西! 謝謝! – Itanium 2014-11-25 07:45:05

0

實例與數值變量這使得以下:

  test   BucketName 
1 615.59148  01. 0 - 5,000 
2 1135.42357  01. 0 - 5,000 
3 5302.24208 02. 5,000 - 10,000 
4 3794.23109  01. 0 - 5,000 
5 2773.70667  01. 0 - 5,000 
... 

和代碼爲

generateLabelsForPivot = function(breaksVector) 
{ 

    startValue = min(breaksVector) 
    lastValue = max(breaksVector) 

    lengthOfBreaks = length(breaksVector) 
    orders   = seq(1, lengthOfBreaks-1, 1) 
    startingPoints = c(breaksVector[-length(breaksVector)]) 
    finishPoints = c(breaksVector[-1]) 

    addingZeros = function(X) 
    { 
     prefix = "" 

     if(nchar(X) == 1) 
     { 
      prefix = "0" 
     } else { 
      prefix = "" 
     } 

     return(paste(prefix, X, ". ", sep = "")) 
    } 

    orderPrefixes = sapply(orders, addingZeros) 
    startingPoints.pretty = prettyNum(startingPoints, scientific=FALSE, big.mark=",", preserve.width = "none") 
    finishPoints.pretty = prettyNum(finishPoints, scientific=FALSE, big.mark=",", preserve.width = "none") 
    labels = paste(orderPrefixes, startingPoints.pretty, " - ", finishPoints.pretty, sep = "") 
    return(labels) 
} 


dataFrame = data.frame(test = runif(100, 0, 100*100)) 

GCV_breaks = c(0, 5000, 10000, 20000, 30000, 1000000) 
GCV_labels = generateLabelsForPivot(GCV_breaks) 
GCV_labels 
GCV_buckets = cut(dataFrame$test, breaks = GCV_breaks, labels = GCV_labels) 

dataFrame$BucketName = GCV_buckets 
相關問題