2015-08-22 16 views
5

隨意編輯這個稱號,使之更容易理解/普及...data.table:馬克前/組內的符號發生後

我有3列,其形成data.table對象組(id,id2pol_loc)。在這些組內是行觀察,每個組或NA的某一行將會有一個星號。我想爲每行相對於星號(之前 - 1,之後 - 0)有效地製作指標列。下面是數據表的樣子:

id id2 pol_loc non_pol cluster_tag 
1: 1 1  3  do   NA 
2: 1 1  3  you   NA 
3: 1 1  3  *   NA 
4: 1 1  3  it   NA 
------------------------------------- 
5: 1 2  3  but   4 
6: 1 2  3  i   NA 
7: 1 2  3  *   NA 
8: 1 2  3 really   2 
9: 1 2  3  bad   NA 
------------------------------------- 
10: 1 2  5  but   4 
11: 1 2  5  i   NA 
12: 1 2  5 hate   NA 
13: 1 2  5 really   2 
14: 1 2  5  *   NA 
15: 1 2  5 dogs   NA 
------------------------------------- 
16: 2 1  4  i   NA 
17: 2 1  4  am   NA 
18: 2 1  4  the   NA 
19: 2 1  4  *   NA 
20: 2 1  4 friend   NA 
------------------------------------- 
21: 3 1  4  do   NA 
22: 3 1  4  you   NA 
23: 3 1  4 really   2 
24: 3 1  4  *   NA 
------------------------------------- 
25: 3 2  NA  NA   NA 
    id id2 pol_loc non_pol cluster_tag 

所需的輸出

下面是所需的輸出:

id id2 pol_loc non_pol cluster_tag before 
1: 1 1  3  do   NA  1 
2: 1 1  3  you   NA  1 
3: 1 1  3  *   NA  NA 
4: 1 1  3  it   NA  0 
---------------------------------------------- 
5: 1 2  3  but   4  1 
6: 1 2  3  i   NA  1 
7: 1 2  3  *   NA  NA 
8: 1 2  3 really   2  0 
9: 1 2  3  bad   NA  0 
---------------------------------------------- 
10: 1 2  5  but   4  1 
11: 1 2  5  i   NA  1 
12: 1 2  5 hate   NA  1 
13: 1 2  5 really   2  1 
14: 1 2  5  *   NA  NA 
15: 1 2  5 dogs   NA  0 
---------------------------------------------- 
16: 2 1  4  i   NA  1 
17: 2 1  4  am   NA  1 
18: 2 1  4  the   NA  1 
19: 2 1  4  *   NA  NA 
20: 2 1  4 friend   NA  0 
---------------------------------------------- 
21: 3 1  4  do   NA  1 
22: 3 1  4  you   NA  1 
23: 3 1  4 really   2  1 
24: 3 1  4  *   NA  NA 
---------------------------------------------- 
25: 3 2  NA  NA   NA  NA 
    id id2 pol_loc non_pol cluster_tag before 

MWE

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), 
    id2 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), pol_loc = c(3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 
    4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, NA), non_pol = c("do", "you", 
    "*", "it", "but", "i", "*", "really", "bad", "but", "i", 
    "hate", "really", "*", "dogs", "i", "am", "the", "*", "friend", 
    "do", "you", "really", "*", NA), cluster_tag = c(NA, NA, 
    NA, NA, "4", NA, NA, "2", NA, "4", NA, NA, "2", NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, "2", NA, NA)), row.names = c(NA, 
-25L), class = "data.frame", .Names = c("id", "id2", "pol_loc", 
"non_pol", "cluster_tag")) 

library(data.table) 

setDT(dat) 

編輯如果它變得更容易或更高效,NA可以變成01它沒有什麼區別,我猜這樣更有效率。

回答

5

嘗試

dat[, before:=1-cumsum(non_pol=="*"), by=.(id, id2, pol_loc)][non_pol=="*", before:=NA,] 
+0

這一個好得多。 – akrun

+1

好簡單,但我不會想到走這條路。真棒。 –

+0

'1-cumsum'對我來說看起來很奇怪,可以創造出0/1的變種。我會用'before:= +(.I <= .I [which(non_pol ==「*」)])'或'1:.N <= which(non_pol ==「*」)' – Frank