的Perl（或R，或SQL）：統計字符串跨越

我有一個文本文件，它看起來像這樣列怎麼經常出現：的Perl（或R，或SQL）：統計字符串跨越

gene1 gene2 gene3 
a  d  c 
b  e  d 
c  f  g 
d  g  
     h 
     i

（每列是一個人的基因，每個包含可變數目的蛋白質（字符串，在這裏顯示爲字母），可以綁定到這些基因）。

我想要做的是計算每個字符串多少列代表了，輸出數量和所有的列標題，就像這樣：

a 1 gene1 
b 1 gene1 
c 2 gene1 gene3 
d 3 gene1 gene2 gene3 
e 1 gene2 
f 1 gene2 
g 2 gene2 gene3 
h 1 gene2 
i 1 gene2

我一直在試圖找出如何做到這一點在Perl和R中，但迄今沒有成功。謝謝你的幫助。

來源

2011-08-04 Stephen Turner

列是製表符分隔還是空格式？這將決定如何對待他們。 –

這個解決方案看起來有點像黑客，但它提供了所需的輸出。它依靠使用plyr和reshape包，但我相信你可以找到基本的R替代品。訣竅是，功能melt可以讓我們將數據拼湊成長格式，從而允許從這一點進行簡單的（ish）操作。

library(reshape) 
library(plyr) 

#Recreate your data 
dat <- data.frame(gene1 = c(letters[1:4], NA, NA), 
        gene2 = letters[4:9], 
        gene3 = c("c", "d", "g", NA, NA, NA) 
       ) 

#Melt the data. You'll need to update this if you have more columns 
dat.m <- melt(dat, measure.vars = 1:3) 

#Tabulate counts 
counts <- as.data.frame(table(dat.m$value)) 

#I'm not sure what to call this column since it's a smooshing of column names 
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " ")) 

#Merge the two together. You could fix the column names above, or just deal with it here 
merge(counts, otherColumn, by.x = "Var1", by.y = "value")

給出：

> merge(counts, otherColumn, by.x = "Var1", by.y = "value") 
    Var1 Freq    V1 
1 a 1    gene1 
2 b 1    gene1 
3 c 2  gene1 gene3 
4 d 3 gene1 gene2 gene3 
....

來源

2011-08-04 03:02:07 Chase

謝謝，總是喜歡R解決方案，尤其是使用** ply函數。 –

您可以簡化爲單個'ddply'調用，使用' ddply（dat.m，。（value），summary，Freq = length（variable），V1 = paste（variable，collapse =「」）） – Ramnath

如果它不是很多列，你可以在sql中做這樣的事情。您基本上將數據平鋪到蛋白質/基因的2列派生表中，然後根據需要對其進行總結。

;with cte as (
    select gene1 as protein, 'gene1' as gene 
    union select gene2 as protein, 'gene2' as gene 
    union select gene3 as protein, 'gene3' as gene 
) 

select protein, count(*) as cnt, group_concat(gene) as gene 
from cte 
group by protein

來源

2011-08-04 02:35:19

錯誤，但困難的部分正在平息數據 – ysth

謝謝。我想過在MySQL中這樣做，但我有很多列。如果需要的話，我會嘗試這個，也許寫一些Perl代碼來寫我的查詢，呃。 –

在Perl中，假設每列中的蛋白質都沒有需要刪除的重複項。（如果他們這樣做，應該使用哈希散列代替。）

use strict; 
use warnings; 

my $header = <>; 
my %column_genes; 
while ($header =~ /(\S+)/g) { 
    $column_genes{$-[1]} = "$1"; 
} 

my %proteins; 
while (my $line = <>) { 
    while ($line =~ /(\S+)/g) { 
     if (exists $column_genes{$-[1]}) { 
      push @{ $proteins{$1} }, $column_genes{$-[1]}; 
     } 
     else { 
      warn "line $. column $-[1] unexpected protein $1 ignored\n"; 
     } 
    } 
} 

for my $protein (sort keys %proteins) { 
    print join("\t", 
     $protein, 
     scalar @{ $proteins{$protein} }, 
     join(' ', sort @{ $proteins{$protein} }) 
    ), "\n"; 
}

從標準輸入讀取，寫入標準輸出。

來源

2011-08-04 02:40:26 ysth

完美。謝謝。 –

我不熟悉$ hash {$ - [1]}語法。這是幹什麼的？ –

'@ -'是一個特殊的數組，用於報告正則表達式捕獲開始的位置（'$ - [1]'，其中'$ 1'開始，'$ _ [2]'爲'$ 2'等） – ysth

甲一個襯裏（或更確切地說3襯墊）

ddply(na.omit(melt(dat, m = 1:3)), .(value), summarize, 
    len = length(variable), 
    var = paste(variable, collapse = " "))

來源

2011-08-04 03:46:13 Ramnath

在MySQL，像這樣：

select protein, count(*), group_concat(gene order by gene separator ' ') from gene_protein group by protein;

假設數據等：

create table gene_protein (gene varchar(255) not null, protein varchar(255) not null); 
insert into gene_protein values ('gene1','a'),('gene1','b'),('gene1','c'),('gene1','d'); 
insert into gene_protein values ('gene2','d'),('gene2','e'),('gene2','f'),('gene2','g'),('gene2','h'),('gene2','i'); 
insert into gene_protein values ('gene3','c'),('gene3','d'),('gene3','g');

來源

2011-08-04 04:31:32 ysth

的Perl（或R，或SQL）：統計字符串跨越

回答

相關問題