2
說我有這些數據。我將這些數據進行子集化處理,以便只保留一行,如果它超過相同顏色的前一行5秒以上。我特別想用data.table
來提高速度。按組和邏輯表達式劃分行 - data.table
示例數據
timestamp Color var1
1: 2015-04-04 16:56:52 red group1
2: 2015-04-04 16:56:53 red group1
3: 2015-04-04 16:56:54 red group1
4: 2015-04-04 16:57:06 red group1
5: 2015-04-04 16:57:07 red group1
6: 2015-04-04 16:57:09 red group1
7: 2015-04-04 16:57:10 red group1
8: 2015-04-04 16:57:11 red group1
9: 2015-04-04 16:57:12 red group1
10: 2015-04-04 16:57:13 red group1
11: 2015-04-04 16:57:14 red group1
12: 2015-04-04 16:57:15 red group1
13: 2015-04-04 16:57:17 red group1
14: 2015-04-04 16:57:18 red group1
15: 2015-04-04 16:57:19 red group1
16: 2015-04-04 16:57:20 red group1
17: 2015-04-04 16:57:21 red group1
18: 2015-04-04 16:57:22 red group1
19: 2015-04-04 16:57:23 red group1
20: 2015-04-04 16:57:24 red group1
21: 2015-04-04 16:57:25 red group1
22: 2015-04-04 16:57:26 red group1
23: 2015-04-04 16:57:27 red group1
24: 2015-04-04 16:57:39 red group1
25: 2015-04-04 16:57:40 red group1
26: 2015-04-04 16:57:41 red group1
27: 2015-04-04 16:58:02 red group1
28: 2015-04-04 16:58:31 yellow group1
29: 2015-04-04 16:58:31 yellow group1
30: 2015-04-04 16:58:32 yellow group1
31: 2015-04-04 16:58:34 red group1
32: 2015-04-04 16:58:35 red group1
33: 2015-04-04 16:58:36 red group1
34: 2015-04-04 16:58:37 red group1
35: 2015-04-04 16:58:38 red group1
36: 2015-04-04 16:58:39 red group1
37: 2015-04-04 16:58:40 red group1
38: 2015-04-04 16:58:41 red group1
39: 2015-04-04 16:58:42 red group1
40: 2015-04-04 16:58:43 red group1
41: 2015-04-04 16:58:44 red group1
42: 2015-04-04 16:58:45 red group1
43: 2015-04-04 16:58:46 red group1
44: 2015-04-04 16:58:47 red group1
45: 2015-04-04 16:58:48 red group1
46: 2015-04-04 16:58:49 red group1
47: 2015-04-04 16:58:50 red group1
48: 2015-04-04 16:58:51 red group1
49: 2015-04-04 16:58:52 red group1
50: 2015-04-04 16:58:53 red group1
51: 2015-04-04 16:58:54 red group1
52: 2015-04-04 16:58:55 red group1
53: 2015-04-04 16:58:56 red group1
54: 2015-04-04 16:58:57 red group1
55: 2015-04-04 16:58:58 red group1
56: 2015-04-04 16:58:59 red group1
57: 2015-04-04 16:59:00 red group1
58: 2015-04-04 16:59:01 red group1
59: 2015-04-04 16:59:02 red group1
60: 2015-04-04 16:59:03 red group1
61: 2015-04-04 16:59:04 red group1
62: 2015-04-04 16:59:05 red group1
63: 2015-04-04 16:59:06 red group1
64: 2015-04-04 16:59:07 red group1
65: 2015-04-04 16:59:08 red group1
66: 2015-04-04 16:59:09 red group1
67: 2015-04-04 16:59:10 red group1
68: 2015-04-04 16:59:11 red group1
69: 2015-04-04 16:59:12 red group1
70: 2015-04-04 16:59:13 red group1
71: 2015-04-04 16:59:14 red group1
72: 2015-04-04 16:59:15 red group1
73: 2015-04-04 16:59:16 red group1
74: 2015-04-04 16:59:17 red group1
75: 2015-04-04 16:59:18 red group1
76: 2015-04-04 16:59:19 red group1
77: 2015-04-04 16:59:20 red group1
78: 2015-04-04 16:59:21 red group1
79: 2015-04-04 16:59:22 red group1
80: 2015-04-04 16:59:23 red group1
81: 2015-04-04 16:59:24 red group1
82: 2015-04-04 16:59:25 red group1
83: 2015-04-04 16:59:26 red group1
84: 2015-04-04 16:59:27 red group1
85: 2015-04-04 16:59:28 red group1
86: 2015-04-04 16:59:29 red group1
87: 2015-04-04 16:59:33 yellow group1
88: 2015-04-04 16:59:59 yellow group1
89: 2015-04-04 17:00:00 yellow group1
90: 2015-04-04 17:00:01 yellow group1
91: 2015-04-04 17:00:02 yellow group1
92: 2015-04-04 17:00:03 yellow group1
93: 2015-04-04 17:00:32 yellow group1
94: 2015-04-04 17:00:33 yellow group1
95: 2015-04-04 17:00:45 red group1
96: 2015-04-04 17:00:46 red group1
97: 2015-04-04 17:00:47 yellow group1
98: 2015-04-04 17:00:47 red group1
99: 2015-04-04 17:00:48 yellow group1
100: 2015-04-04 17:00:49 yellow group1
timestamp Color var1
這是我到目前爲止有:
DT[DT[, .I[timestamp - lag(timestamp)>5], by = Color]$V1]
這給了我這樣的:
timestamp Color var1
1: <NA> NA NA
2: 2015-04-04 16:57:06 red group1
3: 2015-04-04 16:57:39 red group1
4: 2015-04-04 16:58:02 red group1
5: 2015-04-04 16:58:34 red group1
6: 2015-04-04 17:00:45 red group1
7: <NA> NA NA
8: 2015-04-04 16:59:33 yellow group1
9: 2015-04-04 16:59:59 yellow group1
10: 2015-04-04 17:00:32 yellow group1
11: 2015-04-04 17:00:47 yellow group1
這似乎工作確定。但是,我也想保留每個組的第一行(Color)。這顯然是由於邏輯表達式的結果而作爲NA返回。有沒有一種方法來執行此操作並將第一行保留在一個表達式中,而不必重新插入這些行?
數據用於再現實施例
DT <- structure(list(timestamp = structure(c(1428181012, 1428181013,
1428181014, 1428181026, 1428181027, 1428181029, 1428181030, 1428181031,
1428181032, 1428181033, 1428181034, 1428181035, 1428181037, 1428181038,
1428181039, 1428181040, 1428181041, 1428181042, 1428181043, 1428181044,
1428181045, 1428181046, 1428181047, 1428181059, 1428181060, 1428181061,
1428181082, 1428181111, 1428181111, 1428181112, 1428181114, 1428181115,
1428181116, 1428181117, 1428181118, 1428181119, 1428181120, 1428181121,
1428181122, 1428181123, 1428181124, 1428181125, 1428181126, 1428181127,
1428181128, 1428181129, 1428181130, 1428181131, 1428181132, 1428181133,
1428181134, 1428181135, 1428181136, 1428181137, 1428181138, 1428181139,
1428181140, 1428181141, 1428181142, 1428181143, 1428181144, 1428181145,
1428181146, 1428181147, 1428181148, 1428181149, 1428181150, 1428181151,
1428181152, 1428181153, 1428181154, 1428181155, 1428181156, 1428181157,
1428181158, 1428181159, 1428181160, 1428181161, 1428181162, 1428181163,
1428181164, 1428181165, 1428181166, 1428181167, 1428181168, 1428181169,
1428181173, 1428181199, 1428181200, 1428181201, 1428181202, 1428181203,
1428181232, 1428181233, 1428181245, 1428181246, 1428181247, 1428181247,
1428181248, 1428181249), class = c("POSIXct", "POSIXt"), tzone = ""),
Color = c("red", "red", "red", "red", "red", "red", "red",
"red", "red", "red", "red", "red", "red", "red", "red", "red",
"red", "red", "red", "red", "red", "red", "red", "red", "red",
"red", "red", "yellow", "yellow", "yellow", "red", "red",
"red", "red", "red", "red", "red", "red", "red", "red", "red",
"red", "red", "red", "red", "red", "red", "red", "red", "red",
"red", "red", "red", "red", "red", "red", "red", "red", "red",
"red", "red", "red", "red", "red", "red", "red", "red", "red",
"red", "red", "red", "red", "red", "red", "red", "red", "red",
"red", "red", "red", "red", "red", "red", "red", "red", "red",
"yellow", "yellow", "yellow", "yellow", "yellow", "yellow",
"yellow", "yellow", "red", "red", "yellow", "red", "yellow",
"yellow"), var1 = c("group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1",
"group1", "group1", "group1", "group1", "group1", "group1"
)), .Names = c("timestamp", "Color", "var1"), row.names = c(NA,
-100L), class = c("data.table", "data.frame"))
我認爲這是更好分兩步'DT1(NA的行可以通過'fill'被移除)做< - DT [DT [,.I [(時間戳 - 位移(時間戳,填=時間戳[1L]))> 5],by = Color] $ V1]; DT2 < - DT [,.SD [1L],Color]; rbindlist(list(DT1,setcolorder(DT2,names(DT1)))) [訂單(時間戳,顏色)]' – akrun
有趣。如果我們有更多的分組變量而不是「顏色」,那麼簡單地將「list(Color,Var2,Var3)'添加到每行的'Color'部分是否可以? – jalapic
我在下面發佈了一個緊湊的解決方案。我想這就是你想要的。有了更多的變量,是的,需要使用'rbindlist'解決方案輸入更多,因爲我們必須將它們放入'list'或使用'。(Color,Var,..) ' – akrun