2
我想計算data.table
中所有變量之間的關聯度量。 (這是不了統計的問題,但作爲一旁:所述變量的所有因素,並度量是Cramér's V。)所有變量對上的快速交叉表和統計量
實施例的數據集:
p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer
set.seed(1234)
obs <- as.data.table(
data.frame(
cbind(matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE),
nrow=n, ncol=p/2),
matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE),
nrow=n, ncol=p/2)),
stringsAsFactors=TRUE))
我目前使用的
分裂申請-combine方法,它涉及通過所有索引對循環(通過plyr::adply
)併爲每一對返回一行。 (我試圖並行adply
但失敗了。)
# Calculate Cramér's V between all variables -- my kludgey approach
pairs <- t(combn(ncol(obs), 2)) # nx2 matrix contains indices of upper triangle of df
# library('doParallel') # I tried to parallelize -- bonus points for help here (Win 7)
# cl <- makeCluster(8)
# registerDoParallel(cl)
library('plyr')
out <- adply(pairs, 1, function(ix) {
complete_cases <- obs[,which(complete.cases(.SD)), .SDcols=ix]
chsq <- chisq.test(x= dcast(data = obs[complete_cases, .SD, .SDcols=ix],
formula = paste(names(obs)[ix], collapse='~'),
value.var = names(obs)[ix][1], # arbitrary
fun.aggregate=length)[,-1, with=FALSE])
return(data.table(index_1 = ix[1],
var_1 = names(obs)[ix][1],
index_2 = ix[2],
var_2 = names(obs)[ix][2],
cramers_v = sqrt(chsq$statistic/
(sum(chsq$observed) *
(pmin(nrow(chsq$observed),
ncol(chsq$observed)) -1 ))
))
)
})[,-1] #}, .parallel = TRUE)[,-1] # using .parallel returns Error in do.ply(i) :
# task 1 failed - "object 'obs' not found"
out <- data.table(out) # adply won't return a data.table
# stopCluster(cl)
什麼是我的加快這種計算方式?我的挑戰是將pairs
的行方式操作轉換爲obs
中的列方式計算。我想知道是否可以將列對直接生成J
,但Force對這個data.table padawan的強度不夠。
相關問題:[如何在R中進行交叉連接](http://stackoverflow.com/a/14165493/2573061) – C8H10N4O2