用perl

我做的多列文本文件中的一些過濾Perl中刪除從文本文件中唯一的行用perl

該文件的格式如下：

C1 C2 C3 C4 
1 .. .. .. 
2 .. .. .. 
3 .. .. .. 
3 .. .. .. 
3 .. .. ..

我想刪除所有所以輸出應該是這樣的：

C1 C2 C3 C4 
3 .. .. .. 
3 .. .. .. 
3 .. .. ..

我對此文件做了不同的過濾步驟。這是與我的工作

my $ DATA 
my $filename = $ARGV[0]; 
    unless ($filename) { 
     print "Enter filename:\n"; 
     $filename = <STDIN>; 
     chomp $filename; 
    } 
open($DATA,'<',$filename) or die "Could not open file $filename $!"; 
open($OUT,'+>',"processed.txt") or die "Can't write new file: $!"; 

while(<$DATA>){ 
    next if /^\s*#/; 
    print $OUT $_; 
    } 

close $OUT;

正如你所看到的，我在while循環中，我已經使用的下一個命令從文件中刪除註釋行工作的腳本。現在我想添加到這個循環的命令，以刪除列1中的唯一值的所有行。

有人可以幫助我嗎？

來源

2013-03-23 user1987607

你關心線條出來的順序嗎？ – Mat 2013-03-23 21:03:00

大多來自池上和瑪坦被盜：

print "header: ", scalar(<>); 
print "multis: \n"; 

my %seen; 
while (<>) { 
    next if /^\s*#/; 
    my ($id) = /^(\S+)/; 
    ++$seen{$id}{count}; 
    if (1 == $seen{$id}{count}) { 
     # store first occurrence 
     $seen{$id}{line} = $_; 
    } elsif (2 == $seen{$id}{count}) { 
     # print first & second occurrence 
     print $seen{$id}{line}; 
     print $_; 
    } else { 
     # print Third ... occurrence 
     print $_; 
    } 
}

但保持秩序，只使用一個循環。

後來：

三思而後行約

是，他們[線路]應該保持不變，因爲他們現在的樣子，這是數字順序[的IDS]
後

我可以給後面的SOLEN商品：

print "header: ", scalar(<>); 
print "multis: \n"; 

my $ol = scalar(<>);      # first/old line 
my $oi = 0 + (split(" ", $ol, 2))[0];  # first/old id 
my $bf = -1;        # assume old line must be printed 
do { 
    my $cl = scalar(<>);     # current line 
    my $ci = 0 + (split(" ", $cl, 2))[0]; # current id 
    if ($oi != $ci) {      # old and current id differ 
     $oi = $ci;       # remember current/first line of current id 
     $ol = $cl;       # current id becomes old 
     $bf = -1;       # assume first/old line must be printed 
    } else {        # old and current id are equal 
     if ($bf) {       # first/old line of current id must be printed 
     print $ol;      #  do it 
     $bf = 0;       #  but not again 
     } 
     print $cl;       # print current line for same id 
    } 
} while (! eof());

來源

2013-03-23 22:15:36

謝謝，這工作完全 – user1987607 2013-03-24 10:02:06

首先，讓我們擺脫你的程序中無關的東西。

while (<>) { 
    next if /^\s*#/; 
    print; 
}

好吧，它看起來像你甚至沒有額外的第一列的價值。

my ($id) = /^(\S+)/;

我們不知道是否有將是閱讀重複之前，所以我們需要存儲以備後用線。

push @{ $by_id{$id} }, $_;

一旦我們讀完文件，我們就會打印出多行代碼的行。

for my $id (keys(%by_id)) { 
    print @{ $by_id{$id} } if @{ $by_id{$id} } > 1; 
}

最後，你沒有處理好頭，可使用

print scalar(<>);

共完成，我們得到

print scalar(<>); 

my %by_id; 
while (<>) { 
    next if /^\s*#/; 
    my ($id) = /^(\S+)/; 
    push @{ $by_id{$id} }, $_; 
} 

for my $id (sort { $a <=> $b } keys(%by_id)) { 
    print @{ $by_id{$id} } if @{ $by_id{$id} } > 1; 
}

用法：

script.pl file.in >processed.txt

來源

2013-03-23 21:17:41 ikegami

我試過你的解決方案。所有重複行（在列1中具有相同的值）都會被濾除。列1中具有唯一值的行保留在我的文件中。所以我完全和我想要的相反。 – user1987607 2013-03-23 21:57:17

哎呀，你想保留所有的重複，而不是隻有一個。 – ikegami 2013-03-23 22:04:06

是的，我想保留所有重複項並刪除所有的唯一項。 – user1987607 2013-03-23 22:05:20

my %id_count; 
while(my $line = <$DATA>){ 
    next if $line =~ /^\s*#/; 
    my ($id) = split(/\s+/,$line,1); 
    $id_count{$id}{lines} .= $line; 
    $id_count{$id}{counter}++; 
} 

print $OUT join("",map { $id_count{$_}{lines} } grep { $id_count{$_}{counter} ne "1" } keys %id_count);

編輯：如果要保留行的排序順序，只需在最後一行的grep前添加一個sort即可。

來源

2013-03-23 21:22:08 Mattan

這是通過Tie::File整齊地完成，這允許你映射的陣列到一個文本文件，以便去除所述陣列元件也將刪除文件中的行。

該程序需要兩次通過文件：第一次計算每個第一個字段值的次數，第二次刪除文件中該字段唯一的行。

use strict; 
use warnings; 

use Tie::File; 

tie my @file, 'Tie::File', 'textfile.txt' or die $!; 

my %index; 

for (@file) { 
    $index{$1}++ if /^(\d+)/; 
} 

for (my $i = 1; $i < @file; ++$i) { 
    if ($file[$i] =~ /^(\d+)/ and $index{$1} == 1) { 
    splice @file, $i, 1; 
    --$i; 
    } 
}

來源

2013-03-23 22:35:11 Borodin

[領帶::文件可以是出了名的慢（http://perlmonks.org/index.pl?node_id=1000412）與「大」的文件。 – Kenosis 2013-03-24 04:26:06

@Kenosis：請不要延續這些恐怖故事。人們傾向於避免任何被認爲比最快的技術少的技術，即使它是一毫秒和十毫秒運行時間之間的差異。 Tie :: File'非常適合絕大多數實際應用。 – Borodin 2013-03-24 12:50:58

回答

相關問題