2013-07-24 16 views
0

如果我在下面有一個輸入文件,Linux中是否有任何命令/方式將它轉換爲我所需的文件,如下所示?在linux中合​​並行

輸入文件:

Column_1  Column_2 
scaffold_A SNP_marker1 
scaffold_A SNP_marker2 
scaffold_A SNP_marker3 
scaffold_A SNP_marker4 
scaffold_B SNP_marker5 
scaffold_B SNP_marker6 
scaffold_B SNP_marker7 
scaffold_C SNP_marker8 
scaffold_A SNP_marker9 
scaffold_A SNP_marker10 

所需的輸出文件:

Column_1  Column_2 
scaffold_A SNP_marker1;SNP_marker2;SNP_marker3;SNP_marker4 
scaffold_B SNP_marker5;SNP_marker6;SNP_marker7 
scaffold_C SNP_marker8 
scaffold_A SNP_marker9;SNP_marker10 

我想用grep,uniq的等,但還是沒能弄清楚如何得到這個工作。

+0

perl是一個選項嗎? – urzeit

+1

等待,在您的輸出scaffold_A出現兩次。什麼決定是否給予標記應該去第一個或第二個入口? –

+1

@SF。看來OP希望按Column_1分組輸出 - 但僅限於現有組。 –

回答

2

Perl的解決方案:一個bash腳本中

perl -lane 'sub output { 
       print "$last\t", join ";", @buff; 
      } 
      $last //= $F[0]; 
      if ($F[0] ne $last) { 
       output(); 
       undef @buff; 
       $last = $F[0]; 
      } 
      push @buff, $F[1]; 
      }{ output();' 
0

awk的解決方案

#!/bin/bash 

awk ' 
BEGIN{ 
    str = "" 
} 
{ 
    if (str != $1) { 
     if (NR != 1){ 
      printf("\n") 
     } 
     str = $1 
     printf("%s\t%s",$1,$2) 
    } else if (str == $1) { 
     printf(";%s",$2) 
    } 
} 
END{ 
     printf("\n") 
}' your_file.txt 
2

蟒蛇解決方案(在命令行傳遞假設文件名)

from __future__ import print_function #not needed with Python3 
with open('infile') as infile, open('outfile', 'w') as outfile: 
    outfile.write(infile.readline()) # transfer the header 
    col_one, col_two = infile.readline().split() 
    col_two = [col_two] # make it a list 
    for line in infile: 
     data = line.split() 
     if col_one != data[0]: 
      print("{}\t{}".format(col_one, ';'.join(col_two)), file=outfile) 
      col_one = data[0] 
      col_two = [data[1]] 
     else: 
      col_two.append(data[1]) 
    print("{}\t{}".format(col_one, ';'.join(col_two)), file=outfile) 
+0

工作很酷!!!!! 但那裏有一個小小的錯誤。 從腳本生成的輸出稍有不同: Column_1 Column_2 scaffold_A SNP_marker1; Scaffold_A SNP_marker2; SNP_marker3; SNP_marker4 scaffold_B SNP_marker5; SNP_marker6; SNP_marker7 scaffold_C SNP_marker8 scaffold_A SNP_marker9; SNP_marker10 – amine

0

你也可以嘗試以下解決方案在bash中:

cat input.txt | while read L; do y=`echo $L | cut -f1 -d' '`; { test "$x" = "$y" && echo -n ";`echo $L | cut -f2 -d' '`"; } || { x="$y";echo -en "\n$L"; }; done 

或在人更可讀的形式審查:

cat input.txt | while read L; 
do 
    y=`echo $L | cut -f1 -d' '`; 
    { 
    test "$x" = "$y" && echo -n ";`echo $L | cut -f2 -d' '`"; 
    } || 
    { 
    x="$y";echo -en "\n$L"; 
    }; 
done 

注意,在腳本的結果漂亮格式化輸出執行是基於所述bash echo命令。

+0

有[類似的問題類似的解決方案](http://stackoverflow.com/questions/17897255/how-to-merge-類似的線在Linux/18018828#18018828)只是爲了保持附近的類似的東西 – rook

0

如果你不介意使用Python,它有itertools.groupby,供應這個目的:

# file: comebine.py 
import itertools 

with open('data.txt') as f: 
    data = [row.split() for row in f] 

for column1, rows_group in itertools.groupby(data, key=lambda row: row[0]): 
    print column1, ';'.join(column2 for column1, column2 in rows_group) 

保存此腳本combine.py。假設你的輸入文件是data.txt中,運行它以獲得您想要的輸出:

python combine.py 

討論

  • with open(...)塊的結果是data,行的列表,每個行本身是列的列表。
  • itertools.groupby函數需要一個迭代,在這種情況下,一個列表。你告訴它如何使用一個鍵,這是column1將線條組合在一起。
  • rows_group是共享同一列的行的列表1