2013-03-21 74 views
2

(對不起,我奇怪的標題,但我只是想不出把這個短的路)獲取一列中,積極的二進制值與相同值的行數在另一列

因爲我設法過分簡單化我在問最後一個問題時遇到問題,這次我向你提供實際問題。

提供的數據幀包含「usr」,「usrMsgCnt」和「isRefound」列,其中usr是一個名稱,usrMsgCnt是一個數字,isRefound是二進制數。

一個新列被如下加入偉馳值的計算方法:

usrMsgCnt /行數,其中USR等於該線路的USR 和isRefound等於1

對於示例數據的第一行的新值將是:

五分之九用5由 長度產生(數據$ USR [數據$ u個SR ==「Jan.Schrader」 &數據$ isRefound == 1)

通過這個循環是不考慮原始數據集

這裏的大小期權的的一個小塊的dput數據

structure(list(usr = structure(c(21L, 21L, 21L, 21L, 6L, 5L, 
6L, 6L, 6L, 21L, 20L, 21L, 6L, 20L, 21L, 21L, 21L, 6L, 6L, 6L 
), .Label = c("alsmith", "Amanda.Coles", "Andrew.Coles", "babsimieth", 
"Bernd.Ludwig", "Bernhard.Schiemann", "bfueck", "Bram.Ridder", 
"brian.tripney", "carlosgardeazabal", "christine.elsweiler", 
"cmfinner", "daniel.goncalves", "david", "de56", "eko.ma", "freundlu", 
"gmcphail", "ian.ferguson", "Ian.Ruthven", "Jan.Schrader", "jearmour", 
"jyang", "Laura.Schnall", "Marc.Roper", "marek.maleika", "Martin.Hacker", 
"martin.scholz", "maziminke", "mclanger", "Michael.Cashmore", 
"morgan.harvey", "mrussell", "msherrif", "murray.wood", "Nadine.Mahrholz", 
"noam.ascher", "pburns", "Peter.Gregory", "raina", "robertnm", 
"ronald.teijeira", "ronaldtf", "sbenus", "starmstr", "steve.neely", 
"Sven.Friedemann", "tinchen"), class = "factor"), usrMsgCnt = c(9L, 
9L, 9L, 9L, 5L, 0L, 5L, 5L, 5L, 9L, 0L, 9L, 5L, 0L, 9L, 9L, 9L, 
37L, 37L, 37L), isRefound = c(0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L)), .Names = c("usr", 
"usrMsgCnt", "isRefound"), row.names = c(NA, 20L), class = "data.frame") 
+1

也許爲了消除任何不明之處,您可以發佈您希望成爲您在此共享的數據子集的輸出內容。 – A5C1D2H2I1M1N2O1R2T1 2013-03-21 19:32:31

+0

是的,你說得對,給我一分鐘 – Rickyfox 2013-03-21 19:33:27

回答

6

假設isRefound實際上是二進制:

library(data.table) 
DT <- data.table(DF,key="usr") 

DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr] 

編輯:如果訂單是必需的,您不應該設置密鑰(其命令data.table)並創建索引變量(以確保安全)。

DT <- data.table(DF) 
DT[,id:=.I] 
DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr] 
print(DT) 

#     usr usrMsgCnt isRefound id newvar 
# 1:  Jan.Schrader   9   0 1 1.8 
# 2:  Jan.Schrader   9   1 2 1.8 
# 3:  Jan.Schrader   9   1 3 1.8 
# 4:  Jan.Schrader   9   1 4 1.8 
# 5: Bernhard.Schiemann   5   1 5 1.0 
# 6:  Bernd.Ludwig   0   0 6 NaN 
# 7: Bernhard.Schiemann   5   0 7 1.0 
# 8: Bernhard.Schiemann   5   1 8 1.0 
# 9: Bernhard.Schiemann   5   1 9 1.0 
# 10:  Jan.Schrader   9   1 10 1.8 
# 11:  Ian.Ruthven   0   0 11 NaN 
# 12:  Jan.Schrader   9   0 12 1.8 
# 13: Bernhard.Schiemann   5   1 13 1.0 
# 14:  Ian.Ruthven   0   0 14 NaN 
# 15:  Jan.Schrader   9   0 15 1.8 
# 16:  Jan.Schrader   9   0 16 1.8 
# 17:  Jan.Schrader   9   1 17 1.8 
# 18: Bernhard.Schiemann  37   0 18 7.4 
# 19: Bernhard.Schiemann  37   1 19 7.4 
# 20: Bernhard.Schiemann  37   0 20 7.4 

相同的概念方法可以與基礎R方法和plyr方法一起使用證明at your previous question

within(DF, { 
    newvar <- usrMsgCnt/ave(isRefound, usr, FUN = sum) 
}) 

library(plyr) 
ddply(DF, .(usr), transform, 
     newvar = usrMsgCnt/sum(isRefound)) 

然而,data.table包的性能將優於對巨大的數據集。

+0

+1。我在想同樣的觀點。如果需要的話,只有在data.table創建中添加密鑰才能保留原始行順序。 – A5C1D2H2I1M1N2O1R2T1 2013-03-21 19:40:45

+0

原始的行順序是必不可少的,但我不明白你的意思是'不在data.table創建中添加密鑰' - 謹慎詳細說明? – Rickyfox 2013-03-21 20:00:33

相關問題