基於相同的鍵摺疊行

我想根據第一列的相等性摺疊行。然後將第二列的內容添加到新的摺疊表中，以逗號分隔並添加額外空間。另外，如果第二列的內容相同，則摺疊它們，也就是說，如果輸出文件中出現兩次「非劇毒」，則只顯示一次。基於相同的鍵摺疊行

我在這裏很新，請解釋如何運行它。希望任何人都可以幫助我！

輸入（製表符分隔）：

HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung

希望的輸出（製表符分隔）：

HS372_01446 non-virulent, lung 
HS372_00498 non-virulent, lung 
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

來源

2014-02-12 biotech

爲什麼有些你的輸出行（最後1）在逗號和其他字符（前2）之後是否有空格？ –

嗨，埃德，這是一個錯誤。逗號後加空格。 – biotech

的Perl

從命令行，

perl -lane' 
    ($n, $p) [email protected]; 
    $s{$n}++ or push @r, $n; 
    $c{$n}{$p}++ or push @{$h{$n}}, $p; 
    END { 
    $" = ",\t"; 
    print "$_\[email protected]{$h{$_}}" for @r; 
    } 
' file

輸出

HS372_01446  non-virulent, lung 
HS372_00498  non-virulent, lung 
HS372_00954  jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

來源

2014-02-12 10:26:17

這是我見過的最多的一行代碼！ – Borodin

from collections import defaultdict 

a = """HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung""".split("\n") 

stuff = defaultdict(set) 

for line in a: 
    uid, symp = line.split(" ") 
    stuff[uid].add(symp) 

for uid, symps in stuff.iteritems(): 
    print "%s %s" % (uid, ", ".join(list(symps)))

來源

2014-02-12 10:05:13

如何運行？ – biotech

@popnard：Python – Matthias

回溯（最近呼叫最後）：文件「腳本。PY 「22行，在 UID，SYMP = line.split（」「） ValueError異常：需要比1點的值更解壓 – biotech

在perl的：

use warnings; 
use strict; 

open my $input, '<', 'in.txt'; 

my %hash; 
while (<$input>){ 
    chomp; 
    my @split = split(' '); 
    $hash{$split[0]}{$split[1]} = 1; 
} 

for my $key (keys %hash){ 
    print "$key\t"; 
     for my $info (keys $hash{$key}){ 
      print "$info\t"; 
     } 
    print "\n"; 
}

哪個打印：

HS372_01446 non-virulent lung  
HS372_00954 non-virulent moderadamentevirulenta(nose) jointlungCNS lung  
HS372_00498 non-virulent lung

來源

2014-02-12 10:14:01 fugu

Bernardos-的MacBook-PRO：2014_02_12_membrane_genes_PHOBIUS貝爾納$ ./ script.pl 鍵值的參數1的類型必須是./script.pl第16行中的散列或數組（不是散列元素），在「}）」附近執行./script.pl由於編譯錯誤而中止 – biotech

@ popnard - 複製並粘貼更新 – fugu

另一個Perl的溶液：

#!/usr/bin/perl 
use strict; 
use warnings; 
use List::MoreUtils qw/uniq/; 

my %hash; 
while (<DATA>) 
{ 
    chomp; 
    my ($key, $value) = split; 
    push @{$hash{$key}}, $value; 
} 

while (my ($key, $values) = each %hash) 
{ 
    print "$key\t", join ', ', uniq @$values, "\n"; 
} 

__DATA__ 
HS372_01446 non-virulent 
HS372_01446 non-virulent 
HS372_01446 lung 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 non-virulent 
HS372_00498 lung 
HS372_00498 lung 
HS372_00954 jointlungCNS 
HS372_00954 non-virulent 
HS372_00954 non-virulent 
HS372_00954 moderadamentevirulenta(nose) 
HS372_00954 lung

來源

2014-02-12 10:27:07 Chris

令人討厭的變量名稱：'％hash'與'$ scalar'一樣有用''List :: MoreUtils'不是核心模塊，可能需要安裝，'chomp'沒有任何意義，因爲'split'會忽略任何空格。'「\ t」'很少用，因爲它只對最最小的一組數據。但+1，因爲這接近於未分類輸出的最佳解決方案。 – Borodin

爪哇：

javac的Collapse.java

的Java收起input.txt中

import java.io.*; 
import java.util.*; 

public class Collapse { 

    public static void main(String[] args) throws Exception { 
     BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]))); 

     Map<String, Set<String>> output = new HashMap<String, Set<String>>(); 
     String line; 
     while ((line = br.readLine()) != null) { 
      StringTokenizer st = new StringTokenizer(line, "\t"); 
      String key = st.nextToken(); 
      Set<String> set = output.get(key); 
      if (set == null) { 
       output.put(key, set = new LinkedHashSet<String>()); 
      } 
      set.add(st.nextToken()); 
     } 

     for (String key : output.keySet()) { 
      StringBuilder sb = new StringBuilder(); 
      for (String value : output.get(key)) { 
       if (sb.length() != 0) sb.append(", "); 
       sb.append(value); 
      } 
      System.out.println(key + "\t" + sb); 
     } 
    } 
}

來源

2014-02-12 10:37:37 gabor

如果您的數據來自一個MySQL數據庫（你可以將它導入一個），你可以使用group_concat操作。

看到這個答案 Can I concatenate multiple MySQL rows into one field?

這目前標有431個upvotes，所以你的問題是一個非常普遍的問題，答案顯示出非常優雅的解決方案。

來源

2014-02-12 10:39:50 knb

這確實你問什麼，除了保持在相同的順序，它們出現在文件中的ID和描述中，如果該事項：

use strict; 
use warnings; 

open my $fh, '<', 'diseases.txt'; 

my %diseases; 
my @ids; 

while (<$fh>) { 
    my ($id, $desc) = split; 
    if (not $diseases{$id}) { 
    $diseases{$id}{list} = [$desc]; 
    $diseases{$id}{seen}{$desc} = 1; 
    push @ids, $id; 
    } 
    elsif (not $diseases{$id}{seen}{$desc}) { 
    push @{ $diseases{$id}{list} }, $desc; 
    $diseases{$id}{seen}{$desc} = 1; 
    } 
} 

for my $id (@ids) { 
    printf "%s %s\n", $id, join ', ', @{ $diseases{$id}{list} }; 
}

輸出

HS372_01446 non-virulent, lung 
HS372_00498 non-virulent, lung 
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

來源

2014-02-12 11:29:37 Borodin

解析文本文件的標準UNIX工具AWK：

$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]}' file 
HS372_00498  non-virulent, lung 
HS372_00954  jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung 
HS372_01446  non-virulent, lung

來源

2014-02-12 12:22:25

基於相同的鍵摺疊行

回答

相關問題