2010-07-29 105 views
2

我有大約20 CSV的所有看起來像這樣:匹配跨多個CSV文件中的行和合並某一特定領域

"[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]" 

我已經告訴我需要製作的是同樣的事情,但每個文件現在都包含電子郵件匹配的每個其他文件的start_code。

其他任何字段不匹配都沒關係,只是電子郵件字段很重要,並且對每個文件的唯一更改是從其他電子郵件匹配的文件添加其他任何start_code值。

例如,如果相同的電子郵件出現在wicq.csv,oota.csv和itos.csv它會被轉到每個文件中的以下內容:

"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX" 
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"OOTA" 
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"ITOS" 

"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX, OOTA, ITOS" 

所有三個文件(wicq.csv,oota.csv和itos.csv)

工具我還應該提供OS X命令行(AWK,sed的,等等),以及Perl的,雖然我不太熟悉,也有可能做一個更好的方法來做到這一點。

+0

所以這些修改'WIQC,PDX,大田,ITOS'將在這三個CSV文件中的每個人都推了? – Anders 2010-07-29 17:54:17

+0

@安德斯,是的。 (雖然WICQ PDX是一個單獨的修改,而不是你的評論中提到的兩個)。 – alex 2010-07-29 19:12:27

回答

1
use strict; 
use warnings; 
use Text::CSV_XS; 

# Supply csv files as command line arguments. 
my @csv_files = @ARGV; 
my $parser = Text::CSV_XS->new; 

# In my test data, the email is the first field. The field 
# to be merged is the second. Adjust accordingly. 
my $EMAIL_i = 0; 
my $MERGE_i = 1; 

# Process all files, creating a set of key-value pairs: 
# $sc{EMAIL} = [ LIST OF VALUES OBSERVED IN THE MERGE FIELD ] 
my %sc; 
for my $cf (@csv_files){ 
    open(my $fh_in, '<', $cf) or die $!; 

    while (my $line = <$fh_in>){ 
     die "Failed parse : $cf : $.\n" unless $parser->parse($line); 
     my @fields = $parser->fields; 
     push @{ $sc{$fields[$EMAIL_i]} }, $fields[$MERGE_i]; 
    } 
} 

# Process the files again, writing new output. 
for my $cf (@csv_files){ 
    open(my $fh_in, '<', $cf)    or die $!; 
    open(my $fh_out, '>', "${cf}_new.csv") or die $!; 

    while (my $line = <$fh_in>){ 
     die "Failed parse : $cf : $.\n" unless $parser->parse($line); 
     my @fields = $parser->fields; 

     $fields[$MERGE_i] = join ', ', @{ $sc{$fields[$EMAIL_i]} }; 

     $parser->print($fh_out, \@fields); 
     print $fh_out "\n"; 
    } 
} 
+0

這工作得很好!我不得不扔「binmode $ fh_in,」:utf8「;」 手動清理每個文件中的一些空行(:g/^ $/d),但是這有效。謝謝。 – alex 2010-08-03 16:38:35

0

我將通過沿着線做一些處理這個:

cut -d ',' -f1,16 *.csv | 
    sort | 
    awk -F, '{d=""; if (array[$1]) d=","; array[$1] = array[$1] d $2} END { for (i in array) print i "," array[i]}' | 
    while IFS="," read -r email start; do sed -i "/^$email,/ s/,[^,]*\$/,$start/" *.csv; done 

這將創建所有的電子郵件(cut/sort)和start_codes和合並(awk)它們的列表。然後它替換(sed)每個文件中的每個匹配電子郵件的開始代碼(while)。

但我覺得必須有一個更有效的方法。

+0

我將所有文件重命名爲以小寫字母開頭,因爲任何帶大寫字母的字符都會給出此錯誤: 「sed:1:」R2R.csv「:無效的命令代碼R」 我現在正在收到此錯誤: 「sed:1:」bwtl.csv「:undefined label'wtl.csv'」 我認爲這是由同一個初始問題導致的結果,即sed將文件名作爲命令。 – alex 2010-07-29 19:22:51

+0

@alex:仔細檢查以確保您不會錯過星號前的空格,或者您有任何錯位的引號。你是在一個基於GNU的(例如Linux)系統上?你的文件是否在數據中有斜槓?您可以嘗試將'sed'命令中的分隔符更改爲管道('| s | old | new |'')或其他不在數據中的其他字符。 – 2010-07-29 19:41:55

0

這是一個簡單的Perl程序,實現您所需要的。它依靠預先排序的事實對您的輸入進行一次傳遞。

只要電子郵件不變,它就會讀取行並追加代碼。當電子郵件發生更改時,它會打印該記錄(並在代碼字段中修復額外的雙引號)。

#!/usr/bin/perl -l 

use strict; 
use warnings; 

my $last_email = undef; 
my @current_record =(); 
my @fields =(); 

sub print_record { 
    # Remove repeated double quotes introduced when we appended the code 
    $current_record[15] =~ s/""/, /g; 
    print join ",", @current_record; 
    @current_record =(); 
} 

while (my $input_line = <>) { 
    chomp $input_line; 
    @fields = split ",", $input_line; 

    # Print a record when the email we read changes. Avoid printing on the first 
    # loop by checking we have read at least one email ($last_email is defined). 
    defined $last_email && ($fields[0] ne $last_email) && print_record; 

    if ([email protected]_record) { 
    # We are starting to process a new email. Grab all fields. 
    @current_record = @fields; 
    } 
    else { 
    # We have consecutive records with the same email. Append the code. 
    $current_record[15] .= $fields[15]; 
    } 

    # Remember the last processed email. When it changes we will print @current_record. 
    $last_email = $fields[0]; 
} 

# Print the last record 
print_record 

-l開關打印自動添加一個新的行字符(無論os是什麼)。

這樣稱呼它:

sort *.csv | ./script.pl 
相關問題