如何在特定子集的數據中只出現一次特定字詞？

考慮下面的數據集。每個塊開始都是一個'case'。在真實的數據集中，我有數十萬個案例。如果案例中只有一個單詞「排除」（例如10001），我想用「0」替換「排除」一詞。如何在特定子集的數據中只出現一次特定字詞？

如果我通過行循環，我可以統計在每種情況下有多少「排除」。但是，如果只有一行使用「排除」一詞，我不知道如何回到該行並替換該單詞。

我該怎麼做？

10001 
M1|F1|SP1;12;12;12;11;13;10;Exclusion;D16S539 
M1|F1|SP1;12;10;12;9;11;9;3.60;D16S 
M1|F1|SP1;12;10;10;7;11;7;20.00;D7S 
M1|F1|SP1;13;12;12;12;12;12;3.91;D13S 
M1|F1|SP1;11;11;13;11;13;11;3.27;D5S 
M1|F1|SP1;14;12;14;10;12;10;1.99;CSF 
10002 
M1|F1|SP1;8;13;13;8;8;12;2.91;D16S 
M1|F1|SP1;13;11;13;10;10;10;4.13;D7S 
M1|F1|SP1;12;9;12;10;11;16;Exclusion;D13S 
M1|F1|SP1;12;10;12;10;14;15;Exclusion;D5S 
M1|F1|SP1;13;10;10;10;17;18;Exclusion;CSF

來源

2013-10-30 vitor

sub process_block { 
    my ($block) = @_; 
    $block =~ s/\bExclusion\b/0/ 
     if $block !~ /\bExclusion\b.*\bExclusion\b/s; 
    print($block); 
} 

my $buf; 
while (<>) { 
    if (/^\d/) { 
     process_block($buf) if $buf; 
     $buf = ''; 
    } 

    $buf .= $_; 
} 

process_block($buf) if $buf;

來源

2013-10-30 03:27:41 ikegami

優雅！性能方面，我認爲這與@ChuckCottrill的解決方案非常相似，甚至可能稍微好一些？ –

謝謝你們所有的解決方案。 ikegami，這將是一個輕鬆的步驟，修改這一點，以取代「排除」一詞，當它不僅一次，但最多兩次？我試過「if $ block！〜/\bExclusion\b.*\bExclusion\b.*\bExclusion\b/s;」。它的工作原理，但只取代第一次出現。 – vitor

使用's /// g'。 '$ block =〜s/\ bExclusion \ b/0/g if $ block！〜/ \ bExclusion \ b（？：。* \ bExclusion \ b）{2}/s;'。 – ikegami

當你閱讀文件，緩衝了案件中的所有線路，並計算排除，

my ($case,$buf,$count) = (undef,"",0); 
while(my $ln = <>) {

使用正則表達式來檢測的情況下，

if($ln =~ /^\d+$/) { 
     #new case, process/print old case 
     $buf =~ s/;Exclusion;/;0;/ if($count==1); 
     print $buf; 
     ($case,$buf,$count) = ($ln,"",0); 
    }

使用正則表達式現在檢測'排除'？

elsif($ln =~ /;Exclusion;/) { $count++; } 
    $buf .= $l; 
}

當你完成，你可能有一個情況下，留給處理，

if(length($buf)>0) { 
    $buf =~ s/;Exclusion;/;0;/ if($count==1); 
    print $buffer; 
}

來源

2013-10-30 02:32:35 ChuckCottrill

非常好。非常類似於@ ikegami的工作，但正則表達式的魔力遠遠不夠。我認爲這對初學者來說更容易。 –

@Mikko Lipasti - 謝謝！我的計劃是讓初學者接近並理解解決方案。 – ChuckCottrill

「遠遠少於正則表達式魔術」的意思是「不要'\ b'」（因爲正則表達式很微不足道），這是一件壞事。他使用';'代替一個地方（好），而不是另一個地方。他也有很差的耦合（'process_case'很大程度上依賴於代碼之外的代碼），這使得它更難理解，更難以維護並且更容易出錯。 – ikegami

這是我能想到的最好的。假定你讀文件到@lines

# separate into blocks                 
foreach my $line (@lines) { 
    chomp($line); 
    if ($line =~ m/^(\d+)/) { 
     $key = $1; 
    } 
    else { 
     push (@{$block{$key}}, $line); 
    } 
} 

# go through each block                 
foreach my $key (keys %block) { 
    print "$key\n"; 
    my @matched = grep ($_ =~ m/exclusion/i, @{$block{$key}}); 
    if (scalar (1 == @matched)){ 
     foreach my $line (@{$block{$key}}) { 
      $line =~ s/Exclusion/0/i; 
      print "$line\n"; 
     } 
    } 
    else { 
     foreach my $line (@{$block{$key}}) { 
      print "$line\n"; 
     } 
    } 
}

來源

2013-10-30 03:09:37 sam

目前在這裏已經有許多正確的答案，它使用緩存來存儲一個「案例」的內容。

這是另一個使用tell和seek倒回文件的解決方案，因此緩衝區不是必需的。當您的「案例」非常大並且您對性能或內存使用情況很敏感時，這可能很有用。

use strict; 
use warnings; 

open FILE, "text.txt"; 
open REPLACE, ">replace.txt"; 

my $count = 0;  # count of 'Exclusion' in the current case 
my $position = 0; 
my $prev_position = 0; 
my $first_occur_position = 0; # first occurence of 'Exclusion' in the current case 
my $visited = 0; # whether the current line is visited before 

while (<FILE>) { 
    # keep track of the position before reading 
    # the current line 
    $prev_position = $position; 
    $position = tell FILE; 

    if ($visited == 0) { 
     if (/^\d+/) { 
      # new case 
      if ($count == 1) { 
       # rewind to the first occurence 
       # of 'Exclusion' in the previous case 
       seek FILE, $first_occur_position, 0; 
       $visited = 1; 
      } 
      else { 
       print REPLACE $_; 
      } 
     } 
     elsif (/Exclusion/) { 
      $count++; 
      if ($count > 1) { 
       seek FILE, $first_occur_position, 0; 
       $visited = 1; 
      } 
      elsif ($count == 1) { 
       $first_occur_position = $prev_position; 
      } 
     } 
     else { 
      print REPLACE $_ if ($count == 0); 
     } 

     if (eof FILE && $count == 1) { 
      seek FILE, $first_occur_position, 0; 
      $visited = 1; 
     } 
    } 
    else { 
     if ($count == 1) { 
      s/Exclusion/0/; 
     } 
     if (/^\d+/) { 
      $position = tell FILE; 
      $visited = 0; 
      $count = 0; 
     } 
     print REPLACE $_; 
    } 
} 

close REPLACE; 
close FILE;

來源

2013-10-30 04:53:14

如何在特定子集的數據中只出現一次特定字詞？

回答

相關問題