2017-02-17 19 views
1

我在R中工作,長表格存儲爲data.table,其中包含通過數值和字符類型變量的值更改而獲得的值。當我想執行一些函數如相關性,迴歸等時,我必須將表格轉換爲寬格式並且均勻化時間戳頻率。在data.table中有效地將縱向表格轉換爲寬格式

我發現了一種將長錶轉換爲寬的方法,但我認爲效率並不高,而且我想知道是否有更好的原生方法data.table

在下面的可重複的例子中,我包含了我發現的兩個選項來執行寬度較低的轉換,並在評論中指出我認爲哪些部分不是最優的。

library(zoo) 
library(data.table) 
dt<-data.table(time=1:6,variable=factor(letters[1:6]),numeric=c(1:3,rep(NA,3)), 
       character=c(rep(NA,3),letters[1:3]),key="time") 
print(dt) 
print(dt[,lapply(.SD,typeof)]) 

#option 1 

casted<-dcast(dt,time~variable,value.var=c("numeric","character")) 
# types are correct, but I got NA filled columns, 
# is there an option like drop 
# available for columns instead of rows? 
print(casted) 
print(casted[,lapply(.SD,typeof)]) 


# This drop looks ugly but I did not figure out a better way to perform it 
casted[,names(casted)[unlist(casted[,lapply(lapply(.SD,is.na),all)])]:=NULL] 

# I perform a LOCF, I do not know if I could benefit of 
# data.table's roll option somehow and avoid 
# the temporal memory copy of my dataset (this would be the second 
# and minor issue) 
casted<-na.locf(casted) 

#option2 

# taken from http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r 
coalesce2 <- function(...) { 
    Reduce(function(x, y) { 
    i <- which(is.na(x)) 
    x[i] <- y[i] 
    x}, 
    list(...)) 
} 


casted2<-dcast(dt[,coalesce2(numeric,character),by=c("time","variable")], 
     time~variable,value.var="V1") 
# There are not NA columns but types are incorrect 
# it takes more space in a real table (more observations, less variables) 
print(casted2) 
print(casted2[,lapply(.SD,typeof)]) 

# Again, I am pretty sure there is a prettier way to do this 
numericvars<-names(casted2)[!unlist(casted2[,lapply(
    lapply(lapply(.SD,as.numeric),is.na),all)])] 
casted2[,eval(numericvars):=lapply(.SD,as.numeric),.SDcols=numericvars] 

# same as option 1, is there a data.table native way to do it? 
casted2<-na.locf(casted2) 

該過程中的任何建議/改進是值得歡迎的。

回答

2

我也許做焦炭和num表分開,然後rbind:

k  = "time" 
typecols = c("numeric", "character") 

res = rbindlist(fill = TRUE, 
    lapply(typecols, function(tc){ 
    cols = c(k, tc, "variable") 
    dt[!is.na(get(tc)), ..cols][, dcast(.SD, ... ~ variable, value.var=tc)] 
    }) 
) 

setorderv(res, k) 
res[, setdiff(names(res), k) := lapply(.SD, zoo::na.locf, na.rm = FALSE), .SDcols=!k] 

這給

time a b c d e f 
1: 1 1 NA NA NA NA NA 
2: 2 1 2 NA NA NA NA 
3: 3 1 2 3 NA NA NA 
4: 4 1 2 3 a NA NA 
5: 5 1 2 3 a b NA 
6: 6 1 2 3 a b c 

需要注意的是OP的最終結果casted2,不同之處在於它擁有所有的cols作爲焦炭。

+1

對於casted2你是對的,在那裏有一些奇怪的行爲,如果我用casted2 [,eval(numericvars):= ...運行該行,但類型已正確轉換。我不知道爲什麼會發生這種情況,我應該打開一個問題還是提交一個錯誤?除此之外,您的解決方案比我的優雅得多。我認爲當一個字符和一個數字同時發生時,真實數據集中可能會有一些重複,但從這一點開始處理這個問題將很容易。非常感謝 –

相關問題