如何計算重疊dplyr :: GROUP_BY成員之間

我有以下tibble：
如何計算重疊dplyr :: GROUP_BY成員之間

library(tidyverse) 
df <- tibble::tribble(
    ~gene, ~celltype, 
    "a", "cel1_1", 
    "b", "cel1_1", 
    "c", "cel1_1", 
    "a", "cell_2", 
    "b", "cell_2", 
    "c", "cell_3", 
    "d", "cell_3" 
) 

df %>% group_by(celltype) 
#> Source: local data frame [7 x 2] 
#> Groups: celltype [3] 
#> 
#> # A tibble: 7 x 2 
#> gene celltype 
#> <chr> <chr> 
#> 1  a cel1_1 
#> 2  b cel1_1 
#> 3  c cel1_1 
#> 4  a cell_2 
#> 5  b cell_2 
#> 6  c cell_3 
#> 7  d cell_3

在重疊的基因可以歸納如下方式

cell1 a,b,c 
cell2 a,b 
cell3 c,d

什麼我想要做的是計算所有單元格的基因重疊，導致此表格：

  cell1 cell2  cell3 
cell1 3   2   1 
cell2 2   2   0 
cell3 1   0   2

我該如何做到這一點？

更新

最後計算出百分比（在對除以分母最大）

  #cell1    cell2   cell3 
cell1 1.00(3/3)   0.67 (2/3)   0.33 (1/3) 
cell2 0.67 (2/3)   1.00    0 
cell3 0.33 (1/3)   0     1.00

我試過，但沒有得到我想要的：

> tmp <- crossprod(table(df)) 
> tmp/max(tmp) 
     celltype 
celltype cel1_1 cell_2 cell_3 
    cel1_1 1.0000000 0.6666667 0.3333333 
    cell_2 0.6666667 0.6666667 0.0000000 
    cell_3 0.3333333 0.0000000 0.6666667

所以對角線wi總是有1.00的價值。

來源

2017-05-29 pdubois

如果我明白，'res < - tmp/max（tmp）; diag（res）< - 1' – akrun

我們可以使用table與crossprod

crossprod(table(df)) 
#  celltype 
#celltype cell_1 cell_2 cell_3 
# cell_1  3  2  1 
# cell_2  2  2  0 
# cell_3  1  0  2

或者另一種選擇是tidyverse

library(tidyverse) 
count(df, gene, celltype) %>% 
     spread(celltype, n, fill = 0) %>% 
     select(-gene) %>% 
     as.matrix %>% 
     crossprod 
#  cel1_1 cell_2 cell_3 
#cel1_1  3  2  1 
#cell_2  2  2  0 
#cell_3  1  0  2

或用data.table

library(data.table) 
crossprod(as.matrix(dcast(setDT(df), gene~celltype, length)[,-1]))

來源

2017-05-29 04:43:25 akrun

不完全。我正在尋找重疊的例子cell1（a，b，c）cell2（c，d）重疊是（c）所以cell1，cell3的值是1. – pdubois

@pdubois我得到這個 – akrun

的預期輸出，再右吧。我的OP有一個錯誤。我修好了它。 – pdubois

如何計算重疊dplyr :: GROUP_BY成員之間

回答

相關問題