2014-02-12 35 views
0

我想根據第一列的相等性摺疊行。然後將第二列的內容添加到新的摺疊表中,以逗號分隔並添加額外空間。另外,如果第二列的內容相同,則摺疊它們,也就是說,如果輸出文件中出現兩次「非劇毒」,則只顯示一次。基於相同的鍵摺疊行

我在這裏很新,請解釋如何運行它。希望任何人都可以幫助我!

輸入(製表符分隔):

HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung 

希望的輸出(製表符分隔):

HS372_01446 non-virulent, lung 
HS372_00498 non-virulent, lung 
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
+1

爲什麼有些你的輸出行(最後1)在逗號和其他字符(前2)之後是否有空格? –

+0

嗨,埃德,這是一個錯誤。逗號後加空格。 – biotech

回答

2
的Perl

從命令行,

perl -lane' 
    ($n, $p) [email protected]; 
    $s{$n}++ or push @r, $n; 
    $c{$n}{$p}++ or push @{$h{$n}}, $p; 
    END { 
    $" = ",\t"; 
    print "$_\[email protected]{$h{$_}}" for @r; 
    } 
' file 

輸出

HS372_01446  non-virulent, lung 
HS372_00498  non-virulent, lung 
HS372_00954  jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
+2

這是我見過的最多的一行代碼! – Borodin

1
from collections import defaultdict 

a = """HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung""".split("\n") 

stuff = defaultdict(set) 

for line in a: 
    uid, symp = line.split(" ") 
    stuff[uid].add(symp) 

for uid, symps in stuff.iteritems(): 
    print "%s %s" % (uid, ", ".join(list(symps))) 
+0

如何運行? – biotech

+0

@popnard:Python – Matthias

+0

回溯(最近呼叫最後): 文件「腳本。PY 「22行,在 UID,SYMP = line.split(」「) ValueError異常:需要比1點的值更解壓 – biotech

0

在perl的:

use warnings; 
use strict; 

open my $input, '<', 'in.txt'; 

my %hash; 
while (<$input>){ 
    chomp; 
    my @split = split(' '); 
    $hash{$split[0]}{$split[1]} = 1; 
} 

for my $key (keys %hash){ 
    print "$key\t"; 
     for my $info (keys $hash{$key}){ 
      print "$info\t"; 
     } 
    print "\n"; 
} 

哪個打印:

HS372_01446 non-virulent lung  
HS372_00954 non-virulent moderadamentevirulenta(nose) jointlungCNS lung  
HS372_00498 non-virulent lung 
+0

Bernardos-的MacBook-PRO:2014_02_12_membrane_genes_PHOBIUS貝爾納$ ./ script.pl 鍵值的參數1的類型必須是./script.pl第16行中的散列或數組(不是散列元素),在「})」 附近執行./script.pl由於編譯錯誤而中止 – biotech

+0

@ popnard - 複製並粘貼更新 – fugu

2

另一個Perl的溶液:

#!/usr/bin/perl 
use strict; 
use warnings; 
use List::MoreUtils qw/uniq/; 

my %hash; 
while (<DATA>) 
{ 
    chomp; 
    my ($key, $value) = split; 
    push @{$hash{$key}}, $value; 
} 

while (my ($key, $values) = each %hash) 
{ 
    print "$key\t", join ', ', uniq @$values, "\n"; 
} 

__DATA__ 
HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung 
+0

令人討厭的變量名稱:'%hash'與'$ scalar'一樣有用''List :: MoreUtils'不是核心模塊,可能需要安裝,'chomp'沒有任何意義,因爲'split'會忽略任何空格。'「\ t」'很少用,因爲它只對最最小的一組數據。但+1,因爲這接近於未分類輸出的最佳解決方案。 – Borodin

1

爪哇:

javac的Collapse.java

的Java收起input.txt中

import java.io.*; 
import java.util.*; 

public class Collapse { 

    public static void main(String[] args) throws Exception { 
     BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]))); 

     Map<String, Set<String>> output = new HashMap<String, Set<String>>(); 
     String line; 
     while ((line = br.readLine()) != null) { 
      StringTokenizer st = new StringTokenizer(line, "\t"); 
      String key = st.nextToken(); 
      Set<String> set = output.get(key); 
      if (set == null) { 
       output.put(key, set = new LinkedHashSet<String>()); 
      } 
      set.add(st.nextToken()); 
     } 

     for (String key : output.keySet()) { 
      StringBuilder sb = new StringBuilder(); 
      for (String value : output.get(key)) { 
       if (sb.length() != 0) sb.append(", "); 
       sb.append(value); 
      } 
      System.out.println(key + "\t" + sb); 
     } 
    } 
} 
0

如果您的數據來自一個MySQL數據庫(你可以將它導入一個),你可以使用group_concat操作。

看到這個答案 Can I concatenate multiple MySQL rows into one field?

這目前標有431個upvotes,所以你的問題是一個非常普遍的問題,答案顯示出非常優雅的解決方案。

2

這確實你問什麼,除了保持在相同的順序,它們出現在文件中的ID和描述中,如果該事項:

use strict; 
use warnings; 

open my $fh, '<', 'diseases.txt'; 

my %diseases; 
my @ids; 

while (<$fh>) { 
    my ($id, $desc) = split; 
    if (not $diseases{$id}) { 
    $diseases{$id}{list} = [$desc]; 
    $diseases{$id}{seen}{$desc} = 1; 
    push @ids, $id; 
    } 
    elsif (not $diseases{$id}{seen}{$desc}) { 
    push @{ $diseases{$id}{list} }, $desc; 
    $diseases{$id}{seen}{$desc} = 1; 
    } 
} 

for my $id (@ids) { 
    printf "%s %s\n", $id, join ', ', @{ $diseases{$id}{list} }; 
} 

輸出

HS372_01446 non-virulent, lung 
HS372_00498 non-virulent, lung 
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
1

解析文本文件的標準UNIX工具AWK:

$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]}' file 
HS372_00498  non-virulent, lung 
HS372_00954  jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
HS372_01446  non-virulent, lung