如何刪除重複行而忽略特定字符？

我需要從文件中刪除所有重複行的，但忽略了這些人物的悉數亮相：如何刪除重複行而忽略特定字符？

（），、「」。！？#

舉個例子，這兩條線將被認爲是重複的，所以他們中的一個將被刪除：

「This is a line。「 
This is a line

類似地，這三行將被視爲重複，且只有一個將保持：

This is another line、 with more words。 
「This is another line with more words。」 
This is another line！ with more words！

重複的哪行保留在文檔中並不重要。
移除重複項後，不應更改行的順序。
幾乎所有行都有重要的標點符號，但標點符號可能會有所不同。無論保留哪行都可能還有標點符號，因此不應在最終輸出中刪除標點符號。

如何刪除文件中的所有重複行，而忽略某些字符？

來源

2014-02-21 Village

從你的例子中，你可以刪除你的符號，然後刪除你的重複。

例如：

$ cat foo 
«This is a line¡» 
This is another line! with more words¡ 

Similarly, these three lines would be considered duplicates, and only one would remain: 
This is a line 

This is another line, with more words! 
This is another line with more words 

$ tr --delete '¡!«»,' < foo | awk '!a[$0]++' 
This is a line 
This is another line with more words 

Similarly these three lines would be considered duplicates and only one would remain: 

$

似乎做的工作。

編輯：

從你的問題，好像這些符號/標點符號火星並不重要。你應該確切。

我沒有時間來寫，但我認爲最簡單的方法應該是分析文件，並保持已印製線組成的數組：

for each line: 
    cleanedLine = stripFromSymbol(line) 
    if cleanedLine not in AlreadyPrinted: 
    AlreadyPrinted.push(cleanedLine) 
    print line

來源

2014-02-21 11:50:25 fredtantini

這是一種方法。您將它們收集到標準化版本上的數組中。在這裏規範化意味着刪除所有不想要的字符並壓縮空格。然後選擇最短的版本打印/保存。啓發式 - 保留 - 並沒有真正指定這個季節的味道。代碼對於製作來說有點簡潔，所以你可以爲了清晰起見而對它進行充實。

use utf8; 
use strictures; 
use open qw/ :std :utf8 /; 

my %tree; 
while (my $original = <DATA>) { 
    chomp $original; 
    (my $normalized = $original) =~ tr/ （），、「」。！？#/ /sd; 
    push @{$tree{$normalized}}, $original; 
    #print "O:",$original, $/;                              
    #print "N:",$normalized, $/;                             
} 

@{$_} = sort { length $a <=> length $b } @{$_} for values %tree; 

print $_->[0], $/ for values %tree; 

__DATA__ 
「This is a line。「 
This is a line 
This is a line 
This is another line、 with more words。 
This is another line with more words 
This is another line！ with more words！

Yields-

This is another line with more words 
This is a line

來源

2014-02-21 18:26:23 Ashley

如何刪除重複行而忽略特定字符？

回答

相關問題