2015-11-06 61 views
2

樣本數據替換NA與在時間序列中或在相同的列中的相鄰值值 - data.table方法

df <- data.frame(id=c("A","A","A","A","B","B","B","B"),year=c(2014,2014,2015,2015),month=c(1,2),marketcap=c(4,6,2,6,23,2,5,34),return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6)) 

df1 
    id year month marketcap return 
1: A 2014  1   4  NA 
2: A 2014  2   6 0.23 
3: A 2015  1   2 0.20 
4: A 2015  2   6 0.10 
5: B 2014  1  23 0.40 
6: B 2014  2   2 0.90 
7: B 2015  1   5  NA 
8: B 2015  2  34 0.60 

期望數據

desired_df <- data.frame(id=c("A","A","A","A","B","B","B","B"),year=c(2014,2014,2015,2015),month=c(1,2),marketcap=c(4,6,2,6,23,2,5,34),return=c(0.23,0.23,0.2,0.1,0.4,0.9,0.75,0.6)) 

desired_df 
    id year month marketcap return 
1 A 2014  1   4 0.23 
2 A 2014  2   6 0.23 
3 A 2015  1   2 0.20 
4 A 2015  2   6 0.10 
5 B 2014  1  23 0.40 
6 B 2014  2   2 0.90 
7 B 2015  1   5 0.75 
8 B 2015  2  34 0.60 

我想通過替換NA值來內插回與時間序列中的相鄰值一起按id。假設只有兩個月:每年1,2次。 (B,2015,1)的第二個NA替換爲0.75 =(0.9 + 0.6)/ 2 (A,2014,1)的第一個NA替換爲0.23,因爲沒有先前的數據。

data.table溶液是優選的,如果有可能

UPDATE: 當使用的碼結構,如下所示(這對於樣品作品)

df[,returnInterpolate:=na.approx(return,rule=2), by=id] 

我所遇到的錯誤: 錯誤在大約(x [!na],y [!na],xout,...)中: 需要至少兩個非NA值進行插值

我猜可能有一些id沒有非-NA值到interpola TE。 。有什麼建議麼?

+0

'庫(動物園);幫助(「na.approx」)' – Roland

+0

親愛的羅蘭,如何進行na.approx?我想通過id進行插值。順便說一句,我剛剛編輯的問題,我也在尋找data.table解決方案,以瞭解更多的語法 –

回答

4
library(data.table) 
df <- data.frame(id=c("A","A","A","A","B","B","B","B"), 
       year=c(2014,2014,2015,2015), 
       month=c(1,2), 
       marketcap=c(4,6,2,6,23,2,5,34), 
       return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6)) 
setDT(df) 
library(zoo) 
df[, returnInterpol := na.approx(return, rule = 2), by = id] 
# id year month marketcap return returnInterpol 
#1: A 2014  1   4  NA   0.23 
#2: A 2014  2   6 0.23   0.23 
#3: A 2015  1   2 0.20   0.20 
#4: A 2015  2   6 0.10   0.10 
#5: B 2014  1  23 0.40   0.40 
#6: B 2014  2   2 0.90   0.90 
#7: B 2015  1   5  NA   0.75 
#8: B 2015  2  34 0.60   0.60 

編輯:

如果您有羣組只NA值或僅一個非NA,你可以這樣做:

df <- data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C"), 
       year=c(2014,2014,2015,2015), 
       month=c(1,2), 
       marketcap=c(4,6,2,6,23,2,5,34, 1:4), 
       return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6,NA,NA,0.3,NA)) 
setDT(df) 
df[, returnInterpol := switch(as.character(sum(!is.na(return))), 
           "0" = return, 
           "1" = {na.omit(return)}, 
           na.approx(return, rule = 2)), by = id] 

#  id year month marketcap return returnInterpol 
# 1: A 2014  1   4  NA   0.23 
# 2: A 2014  2   6 0.23   0.23 
# 3: A 2015  1   2 0.20   0.20 
# 4: A 2015  2   6 0.10   0.10 
# 5: B 2014  1  23 0.40   0.40 
# 6: B 2014  2   2 0.90   0.90 
# 7: B 2015  1   5  NA   0.75 
# 8: B 2015  2  34 0.60   0.60 
# 9: C 2014  1   1  NA   0.30 
# 10: C 2014  2   2  NA   0.30 
# 11: C 2015  1   3 0.30   0.30 
# 12: C 2015  2   4  NA   0.30 
+0

@PhamCongMinh請標記答案爲接受,如果它有幫助/ – 2015-11-06 08:24:05

+0

你看到我標記了Pascal?親愛的羅蘭,我在使用上述建議時更新了另一個問題,請看看 –

+0

@PhamCongMinh我沒有看到任何綠色勾號。 – 2015-11-06 09:39:47