2016-01-23 55 views
0

我有一個名爲mydf的數據框。有三組列表示爲app,orapin。我想匹配或比較所有列值與應用程序與ora,ora與pin和pin與應用程序列,並獲得一致性/匹配統計信息。我也想獲得三個變量之間的整體一致性,並用圖來表示數據。在R中做這件事的最好方法是什麼?R中三個變量的一致性並繪製數據

mydf<-structure(c("0/0", "0/1", "0/0", "0/0", "0/0", "0/0", "0/0", 
         "0/0", "0/1", "0/0", "0/1", "0/0", "0/0", "0/0", "0/0", "0/0", 
         "0/0", "0/1"), .Dim = c(3L, 6L), .Dimnames = list(c("1", "2", 
                      "4"), c("app:x", "ora:x", "pin:x", "app:y", "ora:y", "pin:y"))) 
+0

我們」重新需要更多的信息。讓我們從你想如何比較這些值開始吧?它們是字符數據,所以唯一的可能性是確定它們是否相同。這是你的意圖嗎?並有6列。你能澄清哪些列需要比較嗎? –

+0

@BryanHanson是的比較基本上是基於字符串匹配。我可能有3 * n個列,並且我想比較一組''app','ora'和'pin'列,所以這是一個累積比較。我只想看看他們的整體一致性。 – MAPK

+0

app:x如何與app:y匹配?他們需要比較嗎?彙集?盧克答案是否有效? –

回答

2

嗯,這裏是一個方法作爲首發(可能是很大的優化,我不是那熟悉的data.table包):

library(splitstackshape) 
dt <- cSplit(melt(cSplit(mydf, 1:6, "/")[, rowname:=rownames(mydf)], id.vars = c("rowname")), 2, ":")[] 
setkey(dt, rowname, variable_2) 
dt <- dt[dt, allow.cartesian=TRUE][variable_1!=i.variable_1] 
idx <- which(!duplicated(cbind(dt$rowname,dt$variable_2, t(apply(dt[, .(variable_1, i.variable_1)], 1, function(x) sort(x)))))) 
dt <- dt[idx, .(rowname, variable_2, variable_1, i.variable_1, isEqual=value==i.value)] 
dt 
#  rowname variable_2 variable_1 i.variable_1 isEqual 
# 1:  1  x_1  ora   app TRUE 
# 2:  1  x_1  pin   app TRUE 
# 3:  1  x_1  pin   ora TRUE 
# 4:  1  x_2  ora   app TRUE 
# 5:  1  x_2  pin   app TRUE 
# 6:  1  x_2  pin   ora TRUE 
# 7:  1  y_1  ora   app TRUE 
# 8:  1  y_1  pin   app TRUE 
# 9:  1  y_1  pin   ora TRUE 
# 10:  1  y_2  ora   app TRUE 
# 11:  1  y_2  pin   app TRUE 
# 12:  1  y_2  pin   ora TRUE 
# 13:  2  x_1  ora   app TRUE 
# 14:  2  x_1  pin   app TRUE 
# 15:  2  x_1  pin   ora TRUE 
# 16:  2  x_2  ora   app FALSE 
# 17:  2  x_2  pin   app FALSE 
# ... 

library(ggplot2) 
ggplot(dt, aes(variable_1, i.variable_1, fill=isEqual)) + 
    geom_tile() + 
    facet_grid(rowname~variable_2) 

enter image description here

+0

謝謝。這種組合比較(app vs ora,ora vs pin and pin vs app)是正確的,但我也想做整體比較(即'app' vs'ora' vs'pin')。 – MAPK

+0

我們如何忽略與NA列值的比較? – MAPK