2014-02-21 34 views
1

我需要從文件中刪除所有重複行的,但忽略了這些人物的悉數亮相:如何刪除重複行而忽略特定字符?

(),、「」。!?# 

舉個例子,這兩條線將被認爲是重複的,所以他們中的一個將被刪除:

「This is a line。「 
This is a line 

類似地,這三行將被視爲重複,且只有一個將保持:

This is another line、 with more words。 
「This is another line with more words。」 
This is another line! with more words! 
  • 重複的哪行保留在文檔中並不重要。
  • 移除重複項後,不應更改行的順序。
  • 幾乎所有行都有重要的標點符號,但標點符號可能會有所不同。無論保留哪行都可能還有標點符號,因此不應在最終輸出中刪除標點符號。

如何刪除文件中的所有重複行,而忽略某些字符?

回答

1

從你的例子中,你可以刪除你的符號,然後刪除你的重複。

例如:

$ cat foo 
«This is a line¡» 
This is another line! with more words¡ 

Similarly, these three lines would be considered duplicates, and only one would remain: 
This is a line 

This is another line, with more words! 
This is another line with more words 

$ tr --delete '¡!«»,' < foo | awk '!a[$0]++' 
This is a line 
This is another line with more words 

Similarly these three lines would be considered duplicates and only one would remain: 

$ 

似乎做的工作。

編輯:

從你的問題,好像這些符號/標點符號火星並不重要。你應該確切。

我沒有時間來寫,但我認爲最簡單的方法應該是分析文件,並保持已印製線組成的數組:

for each line: 
    cleanedLine = stripFromSymbol(line) 
    if cleanedLine not in AlreadyPrinted: 
    AlreadyPrinted.push(cleanedLine) 
    print line 
1

這是一種方法。您將它們收集到標準化版本上的數組中。在這裏規範化意味着刪除所有不想要的字符並壓縮空格。然後選擇最短的版本打印/保存。啓發式 - 保留 - 並沒有真正指定這個季節的味道。代碼對於製作來說有點簡潔,所以你可以爲了清晰起見而對它進行充實。

use utf8; 
use strictures; 
use open qw/ :std :utf8 /; 

my %tree; 
while (my $original = <DATA>) { 
    chomp $original; 
    (my $normalized = $original) =~ tr/ (),、「」。!?#/ /sd; 
    push @{$tree{$normalized}}, $original; 
    #print "O:",$original, $/;                              
    #print "N:",$normalized, $/;                             
} 

@{$_} = sort { length $a <=> length $b } @{$_} for values %tree; 

print $_->[0], $/ for values %tree; 

__DATA__ 
「This is a line。「 
This is a line 
This is a line 
This is another line、 with more words。 
This is another line with more words 
This is another line! with more words! 

Yields-

This is another line with more words 
This is a line