ddply對中的R

由組總和我有一個樣本數據幀的「數據」，如下所示：ddply對中的R

X   Y Month Year income 
2281205 228120 3 2011 1000 
2281212 228121 9 2010 1100 
2281213 228121 12 2010 900 
2281214 228121 3 2011 9000 
2281222 228122 6 2010 1111 
2281223 228122 9 2010 3000 
2281224 228122 12 2010 1889 
2281225 228122 3 2011 778 
2281243 228124 12 2010 1111 
2281244 228124 3 2011 200 
2281282 228128 9 2010 7889 
2281283 228128 12 2010 2900 
2281284 228128 3 2011 3400 
2281302 228130 9 2010 1200 
2281303 228130 12 2010 2000 
2281304 228130 3 2011 1900 
2281352 228135 9 2010 2300 
2281353 228135 12 2010 1333 
2281354 228135 3 2011 2340

我想使用ddply計算收入爲每個Y（未X），如果我有四個每個Y的觀測結果（例如2281223與2010年6,9,12和2011年第3月）。如果我的觀察結果少於四個（例如Y = 228130），我想簡單地忽略它。我用下面的命令在R用於上述目的：

require(plyr) 
    # the data are in the data csv file 
    data<-read.csv("data.csv") 
    # convert Y (integers) into factors 
    y<-as.factor(y) 
    # get the count of each unique Y 
    count<-ddply(data,.(Y), summarize, freq=length(Y)) 
    # get the sum of each unique Y 
    sum<-ddply(data,.(Y),summarize,tot=sum(income)) 
    # show the sum if number of observations for each Y is less than 4 
    colbind<-cbind(count,sum) 
    finalsum<-subset(colbind,freq>3)

我的產量如下：

>colbind 
     Y freq  Y tot 
1 228120 1 228120 1000 
2 228121 3 228121 11000 
3 228122 4 228122 6778 
4 228124 2 228124 1311 
5 228128 3 228128 14189 
6 228130 3 228130 5100 
7 228135 3 228135 5973 
>finalsum 
     Y freq Y.1 tot 
3 228122 4 228122 6778

上面的代碼工作，但需要很多步驟。所以，我想知道是否有執行上述任務的簡單方法（使用plyr包）。

來源

2012-12-26 Metrics

您可以用'summarise'一次性創建'freq'和'tot'變量，並且可能不需要將Y轉換爲因子。 – baptiste

正如評論指出的那樣，你可以做summarize內多次操作。

這樣可以減少你的代碼的ddply()一條線和子集的一條線，這與[運營商很容易：

x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income)) 
x[x$freq > 3, ] 

     Y freq tot 
3 228122 4 6778

這也是與data.table包非常容易：

library(data.table) 
data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3] 
     Y freq tot 
1: 228122 4 6778

實際上，計算向量長度的操作在data.table中有它自己的快捷鍵 - u SE的.N快捷：

data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3] 
     Y freq tot 
1: 228122 4 6778

來源

2012-12-26 04:16:39 Andrie

謝謝。我將我的和你的代碼用於我的擴展樣本，其中N（觀察次數）在35000左右。執行這兩個代碼需要大約200秒的時間。這在ddply函數中是否正常？ – Metrics

是的。 'plyr'非常方便，但速度可能很慢，特別是與'data.table'相比。 – Andrie

我覺得包dplyr快於plyr::ddply，更優雅。

testData <- read.table(file = "clipboard",header = TRUE) 
require(dplyr) 
testData %>% 
    group_by(Y) %>% 
    summarise(total = sum(income),freq = n()) %>% 
    filter(freq > 3)

來源

2014-08-12 09:33:16 HatMatrix

回答

相關問題