計算列中的重複ID並將awk中的值相加或R

-3

我的輸入如下所示。我想創建2個新列 - 一列將是基因名稱的重複數目，另一列將是值的總和。誰能幫忙？計算列中的重複ID並將awk中的值相加或R

輸入：

gene1 5 
gene1 4 
gene2 7 
gene3 6 
gene3 2 
gene3 3

預期輸出：

gene1 2 9 
gene2 1 7 
gene3 3 11

數據：

dd <- read.table(header = FALSE, stringsAsFactors = FALSE, text="gene1 5 
gene1 4 
gene2 7 
gene3 6 
gene3 2 
gene3 3")

來源

2015-12-15 Rashedul Islam

請用簡單的重現方式輸入輸入，例如，使用'dput'in R.另外，您嘗試過什麼？ – iled

'aggregate（dd，by = dd ['V1']，function（x）if（is.numeric（x））sum（x）else length（x））' – rawr

awk 'BEGIN {print "Gene\tCount\tSum"} {a[$1]+=$2;b[$1]++} END {for (i in a) {print i"\t"b[i]"\t"a[i]}}' file 

Gene Count Sum 
gene1 2 9 
gene2 1 7 
gene3 3 11

來源

2015-12-15 21:11:52 user2138595

使用有意義的變量名變得更加清晰：'awk'{cnt [$ 1] ++; sum [$ 1] + = $ 2} END {for（gene in cnt）print gene，cnt [gene]，sum [gene]}'file' –

這是哪門子事dplyr是用於製造。管道操作員還使語法易於理解。所有「col1」和「COL2」，你必須在下面的代碼去相應的名稱來代替：

library('dplyr') 
df %>% group_by(col1) %>% 
    summarise(count=n(), 
    sum=sum(col2))

來源

2015-12-15 20:46:36 mtoto

請提供可重複使用的實際代碼。有關詳細信息，請參閱this question。

首先，我們創建的測試數據：

#libraries 
library(stringr);library(plyr) 

#test data 
df = data.frame(gene = str_c("gene", c(1, 1, 2, rep(3, 3))), 
       count = c(5, 4, 7, 6, 2, 3))

然後我們用ddply總結從plyr包：

#ddply 
ddply(df, .(gene), summarize, 
     gene_count = length(count), 
     sum = sum(count) 
)

這樣做有什麼需要data.frame，由價值拆呢然後以兩種期望的方式總結。見Hadley's introduction to the split, apply and combine route。

結果：

gene gene_count sum 
1 gene1   2 9 
2 gene2   1 7 
3 gene3   3 11

有很多其他的方式來這樣做。

來源

2015-12-15 20:50:09 Deleet

'你只用stringr來做這個？ 'paste0（'gene'，c（1,1,2，rep（3,3）））' – rawr

我總是使用stringr，除了我需要返回匹配值的情況，因爲由於某些原因不支持。在這種情況下，我會回到「grep」。 – Deleet

計算列中的重複ID並將awk中的值相加或R

回答

相關問題