2017-03-29 57 views
-3

我有大約20列的csv數據,每列將有多個不同的值。頂部之後的每一行是標題,是一個單獨的數據樣本。我想以編程方式縮小列表的範圍,以便我有最少數量的數據樣本,但列數據的每個排列仍然表示。perl中的多列最小表示發現

實例數據

SERIAL,ACTIVE,COLOR,CLASS,SEASON,SEATS 
.0xb468d47cc9749fb862990426ff79aafb,T,GREEN,BETA,SUMMER,3 
.0x847129b35bad62f5837eec30dc07a8a4,T,VIOLET,DELTA,SUMMER,1 
.0x14b8df88fd6d6547e387f4caa99e52fd,F,ORANGE,ALPHA,SUMMER,4 
.0x0a07fb97224caf79ea73d3fdd5495b8f,T,YELLOW,DELTA,WINTER,1 
.0x7d747e689bb27b60198283d7b86db409,F,READ,DELTA,SPRING,2 
.0x8247524df49bd19c4c316ee070a2dd4a,T,BLUE,GAMA,WINTER,2 
.0x4103ed42af6e8e463708a6c629907fb5,T,YELLOW,ALPHA,SPRING,5 
.0xc38deea7f02fbfbcdde1d3718d6decb4,T,YELLOW,DELTA,FALL,5 
.0xa3d562edcf64e151d7de08ff8f8e0a94,F,VIOLET,DELTA,SUMMER,3 
.0x9da58b3b05603325c24629f700c25c97,T,YELLOW,OMEGA,SPRING,4 
.0xef0c0e75083229d654c9b111e3af8726,T,BLUE,GAMA,FALL,1 
.0xa9022c8713f0aba2a8e1d20475a3104a,T,YELLOW,BETA,SUMMER,2 
.0x5bb5f73e6030730610866cee80cfc2fb,F,ORANGE,BETA,FALL,5 
.0xc202e5b43dd65525754fdc52b89e7375,T,BLUE,OMEGA,SUMMER,3 
.0xfac9145af33a74aedae7cc0442426432,F,READ,BETA,SPRING,1 
.0x457949648053f710b4f2d55cb237a91d,T,GREEN,BETA,SPRING,3 
.0xed94d4df300f10f5c4dc5d3ac76cf9e5,F,VIOLET,ALPHA,WINTER,15 
.0x870130135beed4cbbe06478e368b40b3,F,YELLOW,ALPHA,SPRING,3 
.0x3b6f17841edb9651e732e3ffbacbe14a,T,GREEN,OMEGA,SUMMER,3 
.0xfb30e054466b9e4cf944c8e48ff74c93,F,VIOLET,DELTA,SUMMER,8 
.0xf741ddc71b4a667585acaa35b67dc6c9,F,BLUE,BETA,FALL,4 
.0x60257ad6c299e466086cc6e5bb0a9a33,F,VIOLET,OMEGA,SPRING,1 
.0xa5d208bfee5a27a7619ba07dcbdaeea0,T,GREEN,OMEGA,FALL,1 
.0x53bc78fa8863e53e8c9fb11c5f6d2320,F,GREEN,GAMA,SPRING,2 
.0x5a01253ce5cb0a6aa5213f34f0b35416,T,READ,BETA,WINTER,3 
.0xaed9a979ba9f6fbf39895b610dde80f4,T,ORANGE,DELTA,WINTER,1 
.0xe7769918e36671af77b5d3d59ea15cfe,T,ORANGE,OMEGA,FALL,4 
.0x9e5327a1583332e4c56d29c356dbc5d2,T,INDEGO,ALPHA,WINTER,5 
.0x79c5c70732ff04b4d00e81ac3a07c3b7,T,READ,OMEGA,FALL,5 
.0x55f54d3c9cd2552e286364894aeef62a,F,READ,GAMA,SPRING,15 
+0

你到目前爲止嘗試過什麼?請[編輯]您的問題以包含代碼。 –

回答

0

使用哈希,以確定您是否見過一個特定的列組合,然後用它來確定是否打印一個特定的行。

這裏是一個相當基本的例子來說明的想法:

filter.pl

#!/usr/bin/env perl 

use warnings; 
use strict; 

die "usage: $0 file col1,col2,col3, ... coln\n" unless @ARGV; 
my ($file, $columns) = @ARGV; 
-f $file or die "$file does not exist!"; 
defined $columns or die "need to pass in columns!"; 

my @columns; 
for my $col (split /,/, $columns) { 
    die "Invalid column id $col" unless $col >= 1; # 1-based 
    push @columns, $col - 1; # 0-based 
} 
scalar @columns or die "No columns!"; 

open my $fh, "<", $file or die "Unable to open $file : $!"; 
my %uniq; 
while (<$fh>) { 
    chomp(); 
    next if $. == 1; # Skip Header 
    my (@data) = split /,/, $_; # Use Text::CSV for any non-trivial csv file 
    my $key = join '|', @data[ @columns ]; # key will look like 'foo|bar|baz' 
    if (not defined $uniq{ $key }) { 
     print $_ . "\n"; # Print the whole line with the first unique set of columns 
     $uniq{ $key } = 1; # Now we have seen this combo 
    } 
} 

data.csv

SERIAL,TRUTH,IN,PARALLEL 
123,TRUE,YES,5 
124,TRUE,YES,5 
125,TRUE,YES,3 
126,TRUE,NO,5 
127,FALSE,YES,1 
128,FALSE,YES,3 
129,FALSE,NO,7 

輸出

perl filter.pl data.csv 2,3 
123,TRUE,YES,5 
126,TRUE,NO,5 
127,FALSE,YES,1 
129,FALSE,NO,7 
+0

就是這樣!我知道我必須跟蹤我已經看到的內容,但我不知道如何將其應用於數據。謝謝! – jbird468