2011-08-04 23 views
9

我有一個文本文件,它看起來像這樣列怎麼經常出現:的Perl(或R,或SQL):統計字符串跨越

gene1 gene2 gene3 
a  d  c 
b  e  d 
c  f  g 
d  g  
     h 
     i 

(每列是一個人的基因,每個包含可變數目的蛋白質(字符串,在這裏顯示爲字母),可以綁定到這些基因)。

我想要做的是計算每個字符串多少列代表了,輸出數量和所有的列標題,就像這樣:

a 1 gene1 
b 1 gene1 
c 2 gene1 gene3 
d 3 gene1 gene2 gene3 
e 1 gene2 
f 1 gene2 
g 2 gene2 gene3 
h 1 gene2 
i 1 gene2 

我一直在試圖找出如何做到這一點在Perl和R中,但迄今沒有成功。謝謝你的幫助。

+0

列是製表符分隔還是空格式?這將決定如何對待他們。 –

回答

8

這個解決方案看起來有點像黑客,但它提供了所需的輸出。它依靠使用plyrreshape包,但我相信你可以找到基本的R替代品。訣竅是,功能melt可以讓我們將數據拼湊成長格式,從而允許從這一點進行簡單的(ish)操作。

library(reshape) 
library(plyr) 

#Recreate your data 
dat <- data.frame(gene1 = c(letters[1:4], NA, NA), 
        gene2 = letters[4:9], 
        gene3 = c("c", "d", "g", NA, NA, NA) 
       ) 

#Melt the data. You'll need to update this if you have more columns 
dat.m <- melt(dat, measure.vars = 1:3) 

#Tabulate counts 
counts <- as.data.frame(table(dat.m$value)) 

#I'm not sure what to call this column since it's a smooshing of column names 
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " ")) 

#Merge the two together. You could fix the column names above, or just deal with it here 
merge(counts, otherColumn, by.x = "Var1", by.y = "value") 

給出:

> merge(counts, otherColumn, by.x = "Var1", by.y = "value") 
    Var1 Freq    V1 
1 a 1    gene1 
2 b 1    gene1 
3 c 2  gene1 gene3 
4 d 3 gene1 gene2 gene3 
.... 
+0

謝謝,總是喜歡R解決方案,尤其是使用** ply函數。 –

+2

您可以簡化爲單個'ddply'調用,使用' ddply(dat.m,。(value),summary,Freq = length(variable),V1 = paste(variable,collapse =「」)) – Ramnath

1

如果它不是很多列,你可以在sql中做這樣的事情。您基本上將數據平鋪到蛋白質/基因的2列派生表中,然後根據需要對其進行總結。

;with cte as (
    select gene1 as protein, 'gene1' as gene 
    union select gene2 as protein, 'gene2' as gene 
    union select gene3 as protein, 'gene3' as gene 
) 

select protein, count(*) as cnt, group_concat(gene) as gene 
from cte 
group by protein 
+1

錯誤,但困難的部分正在平息數據 – ysth

+0

謝謝。我想過在MySQL中這樣做,但我有很多列。如果需要的話,我會嘗試這個,也許寫一些Perl代碼來寫我的查詢,呃。 –

6

在Perl中,假設每列中的蛋白質都沒有需要刪除的重複項。 (如果他們這樣做,應該使用哈希散列代替。)

use strict; 
use warnings; 

my $header = <>; 
my %column_genes; 
while ($header =~ /(\S+)/g) { 
    $column_genes{$-[1]} = "$1"; 
} 

my %proteins; 
while (my $line = <>) { 
    while ($line =~ /(\S+)/g) { 
     if (exists $column_genes{$-[1]}) { 
      push @{ $proteins{$1} }, $column_genes{$-[1]}; 
     } 
     else { 
      warn "line $. column $-[1] unexpected protein $1 ignored\n"; 
     } 
    } 
} 

for my $protein (sort keys %proteins) { 
    print join("\t", 
     $protein, 
     scalar @{ $proteins{$protein} }, 
     join(' ', sort @{ $proteins{$protein} }) 
    ), "\n"; 
} 

從標準輸入讀取,寫入標準輸出。

+0

完美。謝謝。 –

+0

我不熟悉$ hash {$ - [1]}語法。這是幹什麼的? –

+2

'@ -'是一個特殊的數組,用於報告正則表達式捕獲開始的位置('$ - [1]',其中'$ 1'開始,'$ _ [2]'爲'$ 2'等) – ysth

5

甲一個襯裏(或更確切地說3襯墊)

ddply(na.omit(melt(dat, m = 1:3)), .(value), summarize, 
    len = length(variable), 
    var = paste(variable, collapse = " ")) 
1

在MySQL,像這樣:

select protein, count(*), group_concat(gene order by gene separator ' ') from gene_protein group by protein; 

假設數據等:

create table gene_protein (gene varchar(255) not null, protein varchar(255) not null); 
insert into gene_protein values ('gene1','a'),('gene1','b'),('gene1','c'),('gene1','d'); 
insert into gene_protein values ('gene2','d'),('gene2','e'),('gene2','f'),('gene2','g'),('gene2','h'),('gene2','i'); 
insert into gene_protein values ('gene3','c'),('gene3','d'),('gene3','g'); 
相關問題