2013-05-20 36 views
2

你好,我有以下data.frame(追加)。我想添加一個標準化計數的額外列N = N/sum(N)。我有沒有日期列前一個data.frame,並能夠做到這一點使用正常化數據R

oo[, N.norm := N/sum(N), by=Operator]

我試圖通過功能

oo[, N.norm := N/sum(N), by=Operator,Date] 

到日期添加到,但收到一條錯誤消息

Error in `[.data.frame`(oo, , `:=`(N.norm, N/sum(N)), by = Operator, Date) : 
    unused argument(s) (by = Operator) 

例如,對於運營商「A」在月「2013年1月」,我有每個計數N數量= c(「好」,「好」,「差」,「廢話」)。我想總結n該組合(A和2013年1月)和sum(N)

劃分數N在另一方面,任何人都可以給我提供一個體面的介紹操縱data.frames R中

structure(list(Operator = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("A", 
"D", "J", "L", "M"), class = "factor"), ROI_Score = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 4L), .Label = c("Crap", "Good", "OK", "Poor"), class = "factor"), 
    Date = c("Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013"), N = c(0, 0, 0, 0, 0, 1, 2, 15, 1, 5, 3, 2, 3, 
    1, 0, 3, 0, 5, 5, 1, 0, 0, 0, 1, 0, 14, 17, 16, 8, 7, 5, 
    10, 6, 1, 5, 24, 27, 31, 16, 15, 0, 0, 0, 0, 0, 26, 24, 20, 
    11, 18, 3, 4, 17, 3, 2, 20, 36, 12, 21, 9, 0, 0, 0, 0, 0, 
    3, 12, 5, 12, 4, 0, 0, 3, 4, 0, 29, 37, 41, 25, 10, 0, 0, 
    0, 0, 0, 9, 9, 15, 17, 3, 6, 4, 5, 4, 1, 14, 13, 9, 15, 9 
    )), .Names = c("Operator", "ROI_Score", "Date", "N"), row.names = c(NA, 
100L), class = "data.frame") 

我不確定數據是以data.frame還是data.table格式。這裏是我的代碼,改編自阿倫(reshape/remould data frame to create normalized bar chart and pie chart)給出解決辦法

df <- data.frame(read.csv("/misc/jaguar_data/report/system/db_fs/roi_scores.csv")) 
#Get date into nice structure for faceting 
df$Date = strftime(strptime(df$Date,f="%d/%m/%Y"), "%b %Y") 
dt <- data.table(df) 
ops <- as.character(unique(dt$Operator)) 
scr <- as.character(unique(dt$ROI_Score)) 
dts <- unique(dt$Date) 

oo <- setkey(dt[, .N, by="Operator,ROI_Score,Date"], Operator, 
ROI_Score,Date)[CJ(ops, scr,dts)][is.na(N), N:= 0L] 

oo[, N.norm := N/sum(N), by=Operator] 
+2

這個附加列:第i行的N.norm應該是N [i]/sum(N [1 ... i),但是由操作員和日期彙總?你真的是指'data.table'而不是'data.frame'嗎? ':='運算符僅限於'data.table'。請澄清您正在使用的結構:您給了我們一個數據框。 –

+0

@BryanHanson - 我不確定。我已經更新了我的問題,以解釋我如何使用數據結構oo。它最初是一個data.frame,但我認爲它現在是一個data.table – moadeep

+0

你絕對使用'data.table',看你自己的代碼,這使得清楚(你開始一個'data.frame',但它轉向它到'data.table')。通常在數據集非常大且速度非常關鍵時使用這些數據。否則,'data.frame'通常很好。你試圖計算什麼? –

回答

4

你的代碼是(差不多)完美。兩個輕微的問題。

1:您正在使用data.table語法,所以不是oo是一個data.frame它應該是一個data.table。只需使用:

library(data.table) 
oo <- data.table(oo) 

2:當使用by有多個列,請務必將列list(..)或作爲一個單獨的逗號分隔的字符串。例如

oo[, N.norm := N/sum(N), by=list(Operator,Date)] 

# - or - # 
oo[, N.norm := N/sum(N), by="Operator,Date"] 

編輯:如果你希望每個總對每個Operator劃分 - Date組,那麼你的代碼應該是以上。相反,如果你想總的整個數據來劃分,然後用

oo[, N.norm := N/sum(DT$N), by=list(Operator,Date)] 

固定這兩件事情,並使用一切正是因爲你知道了:

 Operator ROI_Score  Date N N.norm 
    1:  A  Crap Apr 2013 0 0.0000000 
    2:  A  Crap Feb 2013 0 0.0000000 
    3:  A  Crap Jan 2013 0 0.0000000 
    4:  A  Crap Mar 2013 0 0.0000000 
    5:  A  Crap May 2013 0 0.0000000 
---           
96:  M  Poor Apr 2013 14 0.4827586 
97:  M  Poor Feb 2013 13 0.5000000 
98:  M  Poor Jan 2013 9 0.3103448 
99:  M  Poor Mar 2013 15 0.4166667 
100:  M  Poor May 2013 9 0.6923077 

編輯2:

只是一個說明。一般來說,如果您使用[括號]中的表達式,尤其是參考賦值運算符:=,那麼您的對象應該是data.table

如果你看到一個錯誤,如

Error in `[.data.frame`(_<your object name>_, ... 

那麼這可能是由於這樣的事實,或者是(a)你的對象不是data.table或(b)你忘了加載數據。表package

+0

非常感謝。我知道它一定是從我已經有的 – moadeep

+1

@moadeep的代碼中簡單的破解,沒問題。請參閱答案底部的編輯註釋 –

1

我不認爲你可以做你想做這個數據集的內容。這裏的原因:

install.packages("plyr") 
library("plyr") 
str(tmp) # this is your data 
count(tmp, vars = c("Operator", "ROI_Score")) 

給出了這樣的:

Operator ROI_Score freq 
1   A  Crap 5 
2   A  Good 5 
3   A  OK 5 
4   A  Poor 5 
5   D  Crap 5 
6   D  Good 5 
7   D  OK 5 
8   D  Poor 5 
9   J  Crap 5 
10  J  Good 5 
11  J  OK 5 
12  J  Poor 5 
13  L  Crap 5 
14  L  Good 5 
15  L  OK 5 
16  L  Poor 5 
17  M  Crap 5 
18  M  Good 5 
19  M  OK 5 
20  M  Poor 5 

而且包括Date使每一個獨特的價值,所以都具有1

使用data.frame計數,你要能在什麼原理獲得者:

ans <- aggregate(N ~ Operator + ROI_Score + Date, data = tmp, FUN = sum) 

然後改變函數做你想要的東西(除以100,條目數?)。但我不確定這是你想要的。

編輯

由於要通過運營商和日期各評級類別的百分比,我會第一子集,然後彙總:

tmp2 <- subset(tmp, Operator == "A") 
ans2 <- aggregate(N ~ ROI_Score, data = tmp2, FUN = sum) 
ans2$N.norm <- ans2$N/sum(ans2$N) 

給出:

ROI_Score N N.norm 
1  Crap 0 0.0000000 
2  Good 24 0.5106383 
3  OK 9 0.1914894 
4  Poor 14 0.2978723 
+0

它不是我所需要的,但我很感謝你的幫助。在上述每個運營商和月份的例子中,有4個可能的分數。如果頻率爲5,那麼總和等於5 + 5 + 5 + 5 = 20。該運營商和月份的百分比分別爲25%,25%,25%,好: 25% – moadeep

+0

看看我的編輯使用不同的方法。 –

+0

非常好。非常感謝您的時間和耐心 – moadeep