2
我有以下巨大的XML文件:如何從一個巨大的xml文件中刪除重複項,但使用perl中的XML :: Twig保留最近的'changedate'屬性?
<?xml version="1.0" encoding="utf-16"?>
<!DOCTYPE tmx SYSTEM "56.dtd">
<body>
<tu changedate="20130625T175037Z"">
<tuv xml:lang="pt-pt">
<prop type="x-context-pre"><seg>Some text.</seg></prop>
<prop type="x-context-post"><seg>Other text.</seg></prop>
<seg>The text I'm interested.</seg>
</tuv>
<tuv xml:lang="it">
<seg>And it's translation in italian.</seg>
</tuv>
</tu>
.... followed by other <tu>'s
</body>
我使用的是哈希賽格「每一位員工‘’內容與它的CHANGEDATE,這樣我可以檢查,使用處理程序,如果‘賽格’已經存在,然後,看看發現的版本是否較舊,如果是,刪除它。這種方法的問題是,如果發現的版本較新,則無法刪除舊版本,這是在xml文件中解析的方式這裏是到目前爲止,我已經得到了代碼:
use 5.010;
use strict;
use warnings;
use XML::Twig;
use Digest::MD5 qw(md5);
my $filename = 'pt_PT-it_IT.tmx';
my $out_filename = 'out.xml';
open my $out, '>', $out_filename;
binmode $out;
my $original_twig = new XML::Twig (pretty_print => 'indented', twig_handlers => {tu => \&original_tu});
$original_twig->parsefile($filename);
$original_twig->flush($out);
close $out;
{ my %md5;
sub original_tu {
my($twig, $original_tu) = @_;
#print $original_tu->text;
my $original_seg = $original_tu->first_child('tuv')->first_child('seg')->text;
my $original_changedate = $original_tu->att('changedate');
$original_changedate = substr $original_changedate, 0, 8;
$hash = md5(original_seg);
if (exists($md5{$hash})) {
if (($md5{$hash}) gt $original_changedate) {
print "================================\n";
print "DELETED\n";
print $original_seg;
print "\n BECAUSE ORIGINAL DATE: ";
print $original_changedate;
print " IS OLDER THAN THE FOUND ONE: ";
print $other_changedate;
print "\n=================================\n";
$original_tu->delete();
}
}
else
$md5{$hash} = $original_changedate;
}
}
在此先感謝您瞭解我如何(重述)在巨大(700 MB)XML文件中刪除具有最新值'changedate'的重複項。
謝謝!
好想法,我會試試,謝謝 – dasen