尋找雙打的話

我將不得不寫（對於運動），其檢查是否存在同樣的話一個文本文件中的perl程序，然後將它們打印到一個新文件（不重複）。尋找雙打的話

有人可以幫助我。我明白，使用m //函數我可以查找單詞，但是如何查找我可能不知道的單詞？例如：如果文本文件有：

喂，你好，你怎麼樣？我不妨把這個文件複製到一個新的而不是的'你好'之一。當然，我不知道文件中是否有任何重複的單詞......這是該程序搜索重複單詞的想法。

我有出去的字母順序排列的話一個基本的腳本，但找到重複單詞的第2步......我想不通。這裏的腳本（希望這是正確至今）：

#!/usr/bin/perl 
use strict; 
use warnings; 

my $source = shift(@ARGV); 
my $cible = shift(@ARGV); 

open (SOURCE, '<', $source) or die ("Can't open $source\n"); 
open (CIBLE, '>', $cible) or die ("Can't open $cible\n"); 

my @lignes = <SOURCE>; 
my @lignes_sorted = sort (@lignes); 

print CIBLE @lignes_sorted; 

chomp @lignes; 
chomp @lignes_sorted; 

print "Original text : @lignes\n"; 

sleep (1); 

print "Sorted text : @lignes_sorted\n"; 

close(SOURCE); 
close (CIBLE);

來源

2013-03-16 joesh

謝謝Kamituel，我只是再次編輯它，以便腳本正確。閱讀指示的時間太晚（發佈後）。 – joesh 2013-03-16 15:17:29

當你死亡時，包含錯誤信息：'$！'：'die（「無法打開$ source：$！\ n」）;' – 2013-03-17 03:56:25

嗨，Andy，你能解釋爲什麼我必須替換'或死'與'$！：死'？你是這個意思嗎？ – joesh 2013-03-17 08:51:52

在Perl：

#!/usr/bin/perl -w 
use strict; 

my $source = shift(@ARGV); 
my $cible = shift(@ARGV); 

open (SOURCE, '<', $source) or die ("Can't open $source\n"); 
open (CIBLE, '>', $cible) or die ("Can't open $cible\n"); 

my @input = sort <SOURCE>; 
my %words =(); 
foreach (@input) { 
    foreach my $word (split(/\s/)) { 
     print CIBLE $word." " unless (exists $words{$word}); 
     $words{$word} = 1; 
    } 
} 

close(SOURCE); 
close (CIBLE);

基本思想是（使用split功能）來分割整個文本爲單個單詞，然後建立一個哈希這個詞作爲重點。在閱讀下一個單詞時，只需檢查這個單詞是否已經在散列中。如果是 - 它是重複的。

對於字符串Hello, Hello, how are you?它打印：Hello, how are you?。

來源

2013-03-16 15:25:33 kamituel

太好了，謝謝。我的原始文件每行一個字 - 這是一個特殊的文本。用你的代碼將單詞放在一行上。我應該研究如何將我的代碼與您的代碼結合起來。有什麼建議麼？ – joesh 2013-03-16 17:04:50

已經知道了。我需要在行上添加一個「\ n」 - 打印CIBLE $ word。「\ n」unles（存在$ words {$ word}） - 非常感謝。我需要查看'。'因爲我不記得在代碼中做了什麼。 – joesh 2013-03-16 17:09:06

另一個小問題是，腳本不會按字母順序將單詞排序爲我的原始腳本。我將不得不看看我能否找出在哪裏放置排序命令？ – joesh 2013-03-17 08:40:33

-1

不知道如何做到這一點在Perl，但可以很容易做到使用sed和一對夫婦FO Unix工具吧。該算法將是：

分隔各個單詞由一個換行符替換空間
排序的話
通過與-c選項的uniq發送的排序詞列表（詞數）
刪除，讓您單次出現（在第一列的1個計數）

該命令會全力以赴的話

（由ENTER TAB和\ n替換\ T）

sed 's/[ \t,.][ \t,.]*/\n/g' filename | sort | uniq -c | sed '/^ *\<1\>/d'

希望有所幫助。

來源

2013-03-16 15:23:08 unxnut

從句子重複數據刪除的話是比它聽起來更復雜。例如，如果在空白處分割句子，您將得到諸如Hello,之類的「單詞」，其中包含非單詞字符，並且該單詞被視爲不重複的真實單詞Hello。有許多因素需要考慮，但假設一個最簡單的情況下，除空白的所有字符組成合法的話，你可以這樣做：

$ perl -anlwe '@F=grep !$seen{$_}++, @F; print "@F";' hello.txt 
Hello, how are you? 
yada Yada this is test material dupe Dupe 

$ cat hello.txt 
Hello, Hello, how are you? 
yada Yada this is test material dupe dupe Dupe

正如你所看到的，它沒有考慮yada和Yada重複。它也不會考慮Hello重複Hello,。您可以通過添加lc或uc的用途來調整此情況以除去案例依賴關係，並允許使用不同的分隔符而不僅僅是空白。

我們在這裏做的是使用散列%seen來跟蹤之前出現的單詞。其基本程序是：

while (<>) {   # reading input file or stdin 
    @F = split;  # splitting $_ on whitespace by default 
    @F = grep !$seen{$_}++, @F; # remove duplicates 
    print "@F";  # print array elements space-separated 
}

的!$seen{$_}++的功能是，在第一次進入一個新的關鍵，表達式將返回true，而所有其他時間錯誤。它是如何工作的？這些都是發生在不同的步驟：

$seen{$_}  # value for key $_ is fetched 
$seen{$_}++ # value for key $_ is incremented, undef -> 1 
       # $foo++ returns the value *before* it is incremented, 
       # so it returns undef 
!$seen{$_}++ # this is now "! undef", meaning "not false", as in true.

對於1及以上的價值觀，這些都是真，not運營商他們都否認了假。

來源

2013-03-16 15:58:43 TLP

非常感謝。這比我習慣的複雜一點，但我會仔細研究一下。 – joesh 2013-03-16 17:03:41

我一直在尋找你建議的'while'解決方案，但我不確定我知道將它放在腳本中的位置。 – joesh 2013-03-17 08:48:38

如果您不擔心發現不同大小寫的重複單詞，那麼您可以使用單個替換來完成此操作。

use strict; 
use warnings; 

my ($source, $cible) = @ARGV; 

my $data; 
{ 
    open ($source_fh, '<', $source) or die ("Can't open $source\n"); 
    local $/; 
    $data = <$source_fh>; 
} 

$data =~ s/\b(\w+)\W+(?=\1\b)//g; 

open (my $cible_fh, '>', $cible) or die ("Can't open $cible\n"); 
print $cible_fh $data;

來源

2013-03-17 03:43:32 Borodin

尋找雙打的話

回答

相關問題