編輯:添加解決方案。Perl將主要鍵合併成2個csv文件
嗨,我目前有一些工作,雖然代碼很慢。
它合併2 CSV文件逐行使用主鍵。 例如,如果文件1具有行:
"one,two,,four,42"
和文件2具有這條線;
"one,,three,,42"
其中0索引$ position = 4的主鍵爲42;
then sub:merge_file($ file1,$ file2,$ outputfile,$ position);
將輸出與行一個文件:
"one,two,three,four,42";
每個主鍵是在每個文件中唯一的,並且一個密鑰可能存在於一個文件而不是在其他(並且反之亦然)
每個文件中大約有100萬行。
通過第一個文件中的每一行,我使用散列來存儲主鍵,並將行號存儲爲值。行號對應於存儲第一個文件中每行的數組[行號]。
然後我遍歷第二個文件中的每一行,並檢查主鍵是否位於散列中,如果是,則從file1array中獲取行,然後將我需要的列從第一個數組添加到第二個數組,然後concat。到最後。然後刪除散列,然後在最後,將整個文件轉儲到文件。 (我使用的是SSD,所以我希望儘量減少文件寫入。)
它可能是最好用的代碼解釋:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my @file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my @from1to2 = qw(2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my @array1 =();
my @array2 =();
my @array2 = split /,/, $_; #split 2nd csv line by commas
my @array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my @output = &merge_string(\@array1,\@array2); #merge 2 csv strings (old fn.)
foreach(@from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", @array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
也先是好的,只需要不到1分鐘,但第二while循環需要大約1小時才能運行,而且我想知道我是否採取了正確的方法。我認爲這是可能的很多加速? :) 提前致謝。
解決方案:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (@file1,@file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push @file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push @file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my @sortedf1 = sort {$a->[$position] <=> $b->[$position]} @file1;
my @sortedf2 = sort {$a->[$position] <=> $b->[$position]} @file2;
print "sorted\n";
@file1 = undef;
@file2 = undef;
#foreach my $line (@file1){print "\t [ @$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ($key1 < $key2){
push(@sortedf2,[@{$sortedf1[$i]}]);
$i++;
}
elsif ($key1 > $key2){
$j++;
}
}
#foreach my $line (@sortedf2){print "\t [ @$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(@sortedf2){
print OUT (join ",", @{$_})."\n";
}
close OUT;
}
謝謝大家,該解決方案張貼以上。現在需要大約1分鐘來合併整個事情! :)
(供參考(數組的數組替換文件打開部分)更理智:http://sunsite.ualberta.ca/Documentation/Misc/perl- 5.6.1/pod/perllol.html) – Dave 2010-06-27 14:06:56
我認爲仍然有足夠的空間進行優化,但是如果速度足夠快就可以使用它。 – 2010-06-27 18:47:54