2013-11-15 53 views
1

當我嘗試子集我的data.table時,我正在丟失數據。R:在子集化期間丟失數據(data.table)

下面是該文件從

Timestamp,Date,Time,SN,A.Ms.Amp,A.Ms.Vol,A.Ms.Watt,Pac 
2013-10-01 12:00:00,2013-10-01,12:00:00,2110000001,23.04,465.43,10723,13544.5 
2013-10-01 12:00:00,2013-10-01,12:00:00,2110000002,7.81,474.16,3704,6860 
2013-10-01 12:00:00,2013-10-01,12:00:00,2110000003,6.97,484.19,3374,6661 
2013-10-01 12:05:00,2013-10-01,12:05:00,2110000001,23.19,467.05,10830,13576 
2013-10-01 12:05:00,2013-10-01,12:05:00,2110000002,8.4,462.52,3883.5,7366.5 
2013-10-01 12:05:00,2013-10-01,12:05:00,2110000003,7.72,470.6,3631,7169 
2013-10-01 12:10:00,2013-10-01,12:10:00,2110000001,23.98,470.29,11278.5,14127.5 
2013-10-01 12:10:00,2013-10-01,12:10:00,2110000002,8.62,458.47,3952,7475.5 
2013-10-01 12:10:00,2013-10-01,12:10:00,2110000003,7.9,462.62,3654,7182.33 
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000001,24.27,467.37,11342,14193 
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000002,8.61,458.96,3949,7502 
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000003,8.13,458.31,3725,7338 
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000001,22.3,461.71,10279.5,12735.5 
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000002,8.51,461.87,3929,7553.5 
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000003,7.83,462.19,3618.5,7331.5 

寫入該.csv這是我跑的代碼:

library(data.table) 
a<-fread("complete1.csv") 
a[,`:=`(Timestamp=ymd_hms(Timestamp), 
Date=ymd(Date), 
SN=as.factor(SN))] 
a[SN==c("2110000001","2110000002"),c("Timestamp","Date","Time","SN","A.Ms.Watt","Pac"),with=FALSE] 

我得到這樣的輸出:

> a[SN==c("2110000001","2110000002"),c("Timestamp","Date","Time","SN","A.Ms.Watt","Pac"),with=FALSE] 
      Timestamp  Date  Time   SN A.Ms.Watt  Pac 
1: 2013-10-01 12:00:00 2013-10-01 12:00:00 2110000001 10723.0 13544.5 
2: 2013-10-01 12:00:00 2013-10-01 12:00:00 2110000002 3704.0 6860.0 
3: 2013-10-01 12:10:00 2013-10-01 12:10:00 2110000001 11278.5 14127.5 
4: 2013-10-01 12:10:00 2013-10-01 12:10:00 2110000002 3952.0 7475.5 
5: 2013-10-01 12:20:00 2013-10-01 12:20:00 2110000001 10279.5 12735.5 
6: 2013-10-01 12:20:00 2013-10-01 12:20:00 2110000002 3929.0 7553.5 
Warning messages: 
1: In is.na(e1) | is.na(e2) : 
    longer object length is not a multiple of shorter object length 
2: In `==.default`(SN, c("2110000001", "2110000002")) : 
    longer object length is not a multiple of shorter object length 

不幸的是,我不不太理解警告。但是我每隔12:xx:x5間隔(例如12:00:05)都會丟失數據。我可能做錯了什麼?

回答

6

這不是data.table問題,而是一個不正確的操作員問題。運營商==是矢量化的。見當你看會發生什麼:

a[,list(Timestamp,SN, SN == c("2110000001","2110000002"))] 

       Timestamp   SN V3 
1: 2013-10-01 12:00:00 2110000001 TRUE 
2: 2013-10-01 12:00:00 2110000002 TRUE 
3: 2013-10-01 12:00:00 2110000003 FALSE 
4: 2013-10-01 12:05:00 2110000001 FALSE 
5: 2013-10-01 12:05:00 2110000002 FALSE 
6: 2013-10-01 12:05:00 2110000003 FALSE 
7: 2013-10-01 12:10:00 2110000001 TRUE 
8: 2013-10-01 12:10:00 2110000002 TRUE 
9: 2013-10-01 12:10:00 2110000003 FALSE 
10: 2013-10-01 12:15:00 2110000001 FALSE 
11: 2013-10-01 12:15:00 2110000002 FALSE 
12: 2013-10-01 12:15:00 2110000003 FALSE 
13: 2013-10-01 12:20:00 2110000001 TRUE 
14: 2013-10-01 12:20:00 2110000002 TRUE 
15: 2013-10-01 12:20:00 2110000003 FALSE 
Warning message: 
In SN == c("2110000001", "2110000002") : 
    longer object length is not a multiple of shorter object length 

這是在R語言手冊記載,在Operators

與整個數據矢量r交易的時間,而大部分的基本運營商而像log這樣的基本數學函數是矢量化的(如上表所示)。這意味着例如添加兩個相同長度的向量將創建一個包含元素總和的向量,隱式循環遍歷向量索引。這也適用於其他運營商,如-,*/以及更高維度的結構。

如果你想TRUESN要麼是價值c("2110000001","2110000002")的,使用%in%,像

SN %in% c("2110000001","2110000002")