結合線與匹配的密鑰

我有具有以下結構的結合線與匹配的密鑰

ID,operator,a,b,c,d,true 
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1 
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1 
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3 
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3 
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3 
WCBP12238,J1,78.6,79.0,56.2,82.1,84.1 
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1 
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1 
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1 
WCBP12239,J1,70.9,71.3,66.0,73.7,82.1 
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1 
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1 
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1

每個ID對應於由操作者多次分析數據集的文本文件。即J1和J2是由運營商J.的第一和第二次測量a，b，c和d使用4個略有不同的算法來衡量其真正價值在於在列中的值true

我想這樣做是創建3個新文本文件，比較J1與J2,S1與S2和J1與S1的結果。例如輸出J1J2 VS：

ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true 
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1 
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3

其中a1爲J1測量a等

另一個例子是用於S1 VS S2：

ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true 
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1 
WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3

這些ID也不會在字母數字訂單也不會爲相同的ID集羣。我不確定如何最好地完成這項任務 - 使用Linux工具或像perl/python這樣的腳本語言。

我最初嘗試使用Linux快速撞了南牆

首先找到所有唯一ID（排序）

awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids

遍歷這些ID和排序J1，J2：

foreach i (`more unique_ids`) 
    grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2 
end

這給了我排序的數據

WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 
WCBP12234,J2,68.6,68.4,41.4,68.9,80.4 
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 
WCBP12238,J1,78.6,79.0,56.2,82.1,82.1 
WCBP12238,J2,75.1,75.2,54.3,76.4,82.1 
WCBP12239,J1,70.9,71.3,66.0,73.7,75.3 
WCBP12239,J2,66.6,72.9,79.5,76.6,75.3

我不確定如何重新排列此數據以獲得所需的結構。我嘗試在foreach環路中添加額外的管道awkawk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'

任何想法？我確信使用awk可以以不那麼麻煩的方式完成，儘管使用適當的腳本語言可能會更好。

來源

2013-06-24 moadeep

在你的第二個代碼塊中，你打算讓'82.1'只出現一次嗎？對不起，重新閱讀這個問題，發現它是'真正'的價值。 – icedwater

另外，你爲什麼需要兩次J1和J2？ – icedwater

icedwater指的是說你想得到'J1對J2，S1對S2和J1對J2'結果的部分。 – TLP

您可以使用Perl csv模塊Text::CSV來提取字段，然後將它們存儲在散列中，其中ID是主鍵，第二個字段是次鍵，所有字段都存儲爲值。那麼做任何你想要的比較應該是微不足道的。如果要保留行的原始順序，可以在第一個循環內使用數組。

use strict; 
use warnings; 
use Text::CSV; 

my %data; 
my $csv = Text::CSV->new({ 
      binary => 1,  # safety precaution 
      eol => $/,  # important when using $csv->print() 
    }); 
while (my $row = $csv->getline(*ARGV)) { 
    my ($id, $J) = @$row; # first two fields 
    $data{$id}{$J} = $row; # store line 
}

來源

2013-06-24 10:10:34 TLP

Python的方式：

import os,sys, re, itertools 
info=["WCBP12236,J1,75.7,80.6,65.9,83.2,82.1", 
    "WCBP12236,J2,76.3,79.6,61.7,81.9,82.1", 
    "WCBP12236,S1,77.2,81.5,69.4,84.1,82.1", 
    "WCBP12236,S2,68.0,68.0,53.2,68.5,82.1", 
    "WCBP12234,J1,63.7,67.7,72.2,71.6,75.3", 
    "WCBP12234,J2,68.6,68.4,41.4,68.9,80.4", 
    "WCBP12234,S1,81.8,82.7,67.0,87.5,75.3", 
    "WCBP12234,S2,66.6,67.9,53.0,70.7,72.7", 
    "WCBP12238,J1,78.6,79.0,56.2,82.1,82.1", 
    "WCBP12239,J2,66.6,72.9,79.5,76.6,75.3", 
    "WCBP12239,S1,86.6,87.8,23.0,23.0,82.1", 
    "WCBP12239,S2,86.0,86.9,62.3,89.7,82.1", 
    "WCBP12239,J1,70.9,71.3,66.0,73.7,75.3", 
    "WCBP12238,J2,75.1,75.2,54.3,76.4,82.1", 
    "WCBP12238,S1,65.9,66.0,40.2,66.5,80.4", 
    "WCBP12238,S2,72.7,73.2,52.6,73.9,72.7" ] 

def extract_data(operator_1, operator_2): 
    operator_index=1 
    id_index=0 
    data={} 
    result=[] 
    ret=[] 
    for line in info: 
     conv_list=line.split(",") 
     if len(conv_list) > operator_index and ((operator_1.strip().upper() == conv_list[operator_index].strip().upper()) or (operator_2.strip().upper() == conv_list[operator_index].strip().upper())): 
      if data.has_key(conv_list[id_index]): 
       iters = [iter(conv_list[int(operator_index)+1:]), iter(data[conv_list[id_index]])] 
       data[conv_list[id_index]]=list(it.next() for it in itertools.cycle(iters)) 
       continue 
      data[conv_list[id_index]]=conv_list[int(operator_index)+1:] 
    return data 

ret=extract_data("j1", "s2") 
print ret

O/P：

{ 'WCBP12239'：['70 0.9' ，'86 0.0' ，'71 0.3' ，'86 0.9' ，' ['72.7'，'78.6'，'73.2'，'79.0'，'52.6'，'''。''，'73.7'，'89.7'，'75.3'，'82.1']，'WCBP12238' 56.2'，'73.9'，'82.1'，'72.7'，'82.1']，'WCBP12234'：['66.6'，'63.7'，'67.9'，'67.7'，'53.0'，'72.2'，' 70.7'，'71.6'，'72.7'，'75.3']，'WCBP12236'：['68.0'，'75.7'，'68.0'，'80.6'，'53.2'，'65.9'，'68.5'，' 83.2'，'82.1'，'82 .1']}

來源

2013-06-24 12:07:24 RakeshSJoshi

我沒有像TLP那樣使用Text :: CSV。如果你需要的話，但是對於這個例子，我想因爲在這個領域沒有嵌入的逗號，所以我在''上做了一個簡單的分割。此外，列出了來自兩個運營商的真實字段（而不是1個），因爲我認爲最後一個值的特例會使解決方案複雜化。

#!/usr/bin/perl 
use strict; 
use warnings; 
use List::MoreUtils qw/ mesh /; 

my %data; 

while (<DATA>) { 
    chomp; 
    my ($id, $op, @vals) = split /,/; 
    $data{$id}{$op} = \@vals; 
} 

my @ops = ([qw/J1 J2/], [qw/S1 S2/], [qw/J1 S1/]); 

for my $id (sort keys %data) { 
    for my $comb (@ops) { 
     open my $fh, ">>", "@$comb.txt" or die $!; 
     my $a1 = $data{$id}{ $comb->[0] }; 
     my $a2 = $data{$id}{ $comb->[1] }; 
     print $fh join(",", $id, mesh(@$a1, @$a2)), "\n"; 
     close $fh or die $!; 
    } 
} 

__DATA__ 
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1 
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1 
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3 
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3 
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3 
WCBP12239,J1,78.6,79.0,56.2,82.1,82.1 
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1 
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1 
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1 
WCBP12238,J1,70.9,71.3,66.0,73.7,84.1 
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1 
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1 
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1

產生的輸出文件是以下

J1 J2.txt

WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3,75.3 
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1,82.1 
WCBP12238,70.9,75.1,71.3,75.2,66.0,54.3,73.7,76.4,84.1,84.1 
WCBP12239,78.6,66.6,79.0,72.9,56.2,79.5,82.1,76.6,82.1,82.1

S1 S2.txt

WCBP12234,81.8,66.6,82.7,67.9,67.0,53.0,87.5,70.7,75.3,75.3 
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1,82.1 
WCBP12238,65.9,72.7,66.0,73.2,40.2,52.6,66.5,73.9,84.1,84.1 
WCBP12239,86.6,86.0,87.8,86.9,23.0,62.3,23.0,89.7,82.1,82.1

J1 S1.txt

WCBP12234,63.7,81.8,67.7,82.7,72.2,67.0,71.6,87.5,75.3,75.3 
WCBP12236,75.7,77.2,80.6,81.5,65.9,69.4,83.2,84.1,82.1,82.1 
WCBP12238,70.9,65.9,71.3,66.0,66.0,40.2,73.7,66.5,84.1,84.1 
WCBP12239,78.6,86.6,79.0,87.8,56.2,23.0,82.1,23.0,82.1,82.1

更新：要獲得唯一1真值，for循環可以這樣寫：

for my $id (sort keys %data) { 
    for my $comb (@ops) { 
     local $" = ''; 
     open my $fh, ">>", "@$comb.txt" or die $!; 
     my $a1 = $data{$id}{ $comb->[0] }; 
     my $a2 = $data{$id}{ $comb->[1] }; 
     pop @$a2; 
     my @mesh = grep defined, mesh(@$a1, @$a2); 
     print $fh join(",", $id, @mesh), "\n"; 
     close $fh or die $!; 
    } 
}

更新：增加了 '定義' 中的grep EXPR測試。因爲它是正確的方式（而不是僅僅測試'$ _'，它可能是0，並且被grep錯誤地排除在列表之外）。

來源

2013-06-24 19:04:22

結合線與匹配的密鑰

回答

相關問題