Perl - 文件編碼和字比較

我有一個文件有一個短語/術語每一行，我從STDIN讀取perl。我有一個停用詞表（如「á」，「são」，「é」），我想比較每個詞與每個詞，並刪除它們是否相等。問題是我不確定該文件的編碼格式。Perl - 文件編碼和字比較

我得到這個從file命令：

words.txt: Non-ISO extended-ASCII English text

我的Linux終端是UTF-8，它顯示了一些單詞正確的內容和別人不一樣。下面是其中一些輸出：

condi<E3> 
conte<FA>dos 
ajuda, mas não resolve 
mo<E7>ambique 
pedagógico são fenómenos

你可以看到，第3和第5行正確識別單詞，重音和特殊字符，而有的則沒有。其他線路的正確輸出應該是：condiã，conteúdos和莫桑比克。

如果我使用binmode(STDOUT, utf8)，「錯誤」行現在可以正確輸出，而其他的不行。例如，第三行：

阿茹達，MAS NA£Ø決心

我應該怎麼辦的傢伙？

來源

2011-05-05 Barata

它的工作原理是這樣的：

C:\Dev\Perl :: chcp 
Aktive Codepage: 1252. 

C:\Dev\Perl :: type mixed-encoding.txt 
eins zwei drei KÃ¤se vier fÃ¼nf Wurst 
eins zwei drei Käse vier fünf Wurst 

C:\Dev\Perl :: perl mixed-encoding.pl < mixed-encoding.txt 
eins zwei drei vier fünf 
eins zwei drei vier fünf

凡mixed-encoding.pl是這樣的：

use strict; 
use warnings; 
use utf8; # source in UTF-8 
use Encode 'decode_utf8'; 
use List::MoreUtils 'any'; 

my @stopwords = qw(Käse Wurst); 

while (<>) { # read octets 
    chomp; 
    my @tokens; 
    for (split /\s+/) { 
     # Try UTF-8 first. If that fails, assume legacy Latin-1. 
     my $token = eval { decode_utf8 $_, Encode::FB_CROAK }; 
     $token = $_ if [email protected]; 
     push @tokens, $token unless any { $token eq $_ } @stopwords; 
    } 
    print "@tokens\n"; 
}

注意，腳本沒有在UTF-8編碼。只是，如果腳本中包含時髦的字符數據，則必須確保編碼匹配，如果編碼爲UTF-8，則爲use utf8，如果編碼不是UTF-8，則不要。

更新基於tchrist的中肯的意見：

use strict; 
use warnings; 
# source in Latin1 
use Encode 'decode'; 
use List::MoreUtils 'any'; 

my @stopwords = qw(Käse Wurst); 

while (<>) { # read octets 
     chomp; 
     my @tokens; 
     for (split /\s+/) { 
       # Try UTF-8 first. If that fails, assume 8-bit encoding. 
       my $token = eval { decode utf8 => $_, Encode::FB_CROAK }; 
       $token = decode Windows1252 => $_, Encode::FB_CROAK if [email protected]; 
       push @tokens, uc $token unless any { $token eq $_ } @stopwords; 
     } 
     print "@tokens\n"; 
}

來源

2011-05-05 18:21:49 Lumi

@邁克爾由於現在它的輸出正確;）我意識到，大多數的文件是ISO-8859-1和UTF-8某些部分（這就是爲什麼有些人被正確輸出）一個更事情。我不得不使用'lc'函數，因爲我的停用詞都是較低的，而且當短語不是utf-8時，我遇到了問題。在這種情況下，如果我有一個重音字母的大寫字母，它不會低下。 – Barata 2011-05-05 18:58:56

@Barata：如果你想'uc'等工作，你仍然需要解碼非UTF8字符串。 Perl 5.12（及以上版本）'unicode_strings'特性也可能有所幫助，因爲它將假定字節串爲ISO 8859-1。比較：'perl -e'print uc（「\ xB5 \ xE9 \ xDF」）''=> 'μéß'，**這是錯誤的，**與 'perl -M5.012 -e'print uc 「\ xB5 \ xE9 \ xDF」）''=>'ΜÉSS'**這是正確的。**最後一個字符串實際上是'\ x {39C} \ x {C9} SS「'或'」\ N {希臘大寫字母小寫} \ N {拉丁文大寫字母E} AC。原始字符串是'\ N {MICRO SIGN} \ N {LATIN小字母E WITH ACUTE} \ N {拉丁小寫字母夏普S}「。 – tchrist 2011-05-05 19:09:11

@tchrist使用Michael代碼，檢查'if $ @'並將字符串解碼爲iso-8859-1就足夠了？ – Barata 2011-05-05 19:18:32

我強烈建議你創建一個過濾器，需要在混合編碼行的文件，並將它們轉換爲純UTF-8。然後，而不是

open(INPUT, "< badstuff.txt") || die "open failed: $!";

，你會從定影液打開任一固定的版本，或管道，如：

open(INPUT, "fixit < badstuff.txt |") || die "open failed: $!"

在這兩種情況下，你會再

binmode(INPUT, ":encoding(UTF-8)") || die "binmode failed";

然後fixit程序可以這樣做：

use strict; 
use warnings; 
use Encode qw(decode FB_CROAK); 

binmode(STDIN, ":raw") || die "can't binmode STDIN"; 
binmode(STDOUT, ":utf8") || die "can't binmode STDOUT"; 

while (my $line = <STDIN>) { 
    $line = eval { decode("UTF-8", $line, FB_CROAK() }; 
    if ([email protected]) { 
     $line = decode("CP1252", $line, FB_CROAK()); # no eval{}! 
    } 
    $line =~ s/\R\z/\n/; # fix raw mode reads 
    print STDOUT $line;  
} 

close(STDIN) || die "can't close STDIN: $!"; 
close(STDOUT) || die "can't close STDOUT: $!"; 
exit 0;

看看如何工作？當然，您可以將其更改爲其他編碼的默認值，或者有多個回退。可能最好在@ARGV中列出它們的列表。

來源

2011-05-05 19:23:28 tchrist

當從UTF-8解碼失敗時，從特定的編碼解碼非常好。所以你最終不會混合使用Unicode和傳統字符串，而是將所有內容均勻化爲Unicode。 – Lumi 2011-05-05 21:39:09

Perl - 文件編碼和字比較

回答

相關問題