2015-11-02 32 views
2

我想計算data.table中所有變量之間的關聯度量。 (這是了統計的問題,但作爲一旁:所述變量的所有因素,並度量是Cramér's V。)所有變量對上的快速交叉表和統計量

實施例的數據集:

p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer 
set.seed(1234) 
obs <- as.data.table( 
     data.frame(
      cbind(matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE), 
         nrow=n, ncol=p/2), 
        matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE), 
         nrow=n, ncol=p/2)), 
     stringsAsFactors=TRUE)) 
我目前使用的

分裂申請-combine方法,它涉及通過所有索引對循環(通過plyr::adply)併爲每一對返回一行。 (我試圖並行adply但失敗了。)

# Calculate Cramér's V between all variables -- my kludgey approach 

pairs <- t(combn(ncol(obs), 2)) # nx2 matrix contains indices of upper triangle of df 

# library('doParallel') # I tried to parallelize -- bonus points for help here (Win 7) 
# cl <- makeCluster(8) 
# registerDoParallel(cl) 
library('plyr') 
out <- adply(pairs, 1, function(ix) { 
     complete_cases <- obs[,which(complete.cases(.SD)), .SDcols=ix] 
     chsq <- chisq.test(x= dcast(data = obs[complete_cases, .SD, .SDcols=ix], 
            formula = paste(names(obs)[ix], collapse='~'), 
            value.var = names(obs)[ix][1], # arbitrary 
            fun.aggregate=length)[,-1, with=FALSE]) 
     return(data.table(index_1 = ix[1], 
          var_1 = names(obs)[ix][1], 
          index_2 = ix[2], 
          var_2 = names(obs)[ix][2], 
          cramers_v = sqrt(chsq$statistic/
              (sum(chsq$observed) * 
               (pmin(nrow(chsq$observed), 
                 ncol(chsq$observed)) -1 )) 
         )) 
     ) 
     })[,-1] #}, .parallel = TRUE)[,-1] # using .parallel returns Error in do.ply(i) : 
             # task 1 failed - "object 'obs' not found" 
out <- data.table(out) # adply won't return a data.table 
# stopCluster(cl) 

什麼是我的加快這種計算方式?我的挑戰是將pairs的行方式操作轉換爲obs中的列方式計算。我想知道是否可以將列對直接生成J,但Force對這個data.table padawan的強度不夠。

+0

相關問題:[如何在R中進行交叉連接](http://stackoverflow.com/a/14165493/2573061) – C8H10N4O2

回答

1

首先,我會「長」數據格式去如下:

obs[, id := 1:n] 
mobs <- melt(obs, id.vars = 'id') 

下一組數據表setkeyv(mobs, 'id')關鍵。

最後,遍歷變量,做計算的對:

out <- list() 
for(i in 1:p) { 
    vari <- paste0('X', i) 
    tmp <- mobs[mobs[variable == vari]] 
    nn <- tmp[!(is.na(value) | is.na(i.value)), list(i.variable = i.variable[1], nij = length(id)), keyby = list(variable, value, i.value)] 
    cj <- nn[, CJ(value = value, i.value = i.value, sorted = FALSE, unique = TRUE), by = variable] 
    setkeyv(cj, c('variable', 'value', 'i.value')) 
    nn <- nn[cj] 
    nn[is.na(nij), nij := 0] 
    nn[, ni := sum(nij), by = list(variable, i.value)] 
    nn[, nj := sum(nij), by = list(variable, value)] 
    nn[, c('n', 'r', 'k') := list(sum(nij), length(unique(i.value)), length(unique(value))), by = variable] 
    out[[i]] <- nn[, list(i.variable = vari, cramers_v = (sqrt(sum((nij - ni * nj/n)^2/(ni * nj/n))/n[1])/min(k[1] - 1, r[1] - 1))), by = variable] 
} 
out <- rbindlist(out) 

所以你需要通過變量的一次迭代。正如你所看到的,我也不會使用chisq.test,並自己寫計算。