Perl：utf8 :: decode與Encode :: decode

我有一些有趣的結果，試圖辨別使用Encode::decode("utf8", $var)和utf8::decode($var)之間的差異。我已經發現，在一個變量上多次調用前者最終會導致一個錯誤：「無法解碼帶有寬字符的字符串......」，而後一種方法會很高興地隨意運行，只是返回false。Perl：utf8 :: decode與Encode :: decode

我難以理解的是length函數根據您使用哪種解碼方法返回不同的結果。問題出現是因爲我正在處理來自外部文件的「雙重編碼」utf8文本。爲了演示這個問題，我在一行中創建了一個帶有以下Unicode字符的文本文件「test.txt」：U + 00e8，U + 00ab，U + 0086，U + 000a。這些Unicode字符是Unicode字符U + 8acb的雙重編碼，以及換行符。該文件以UTF8編碼到磁盤。我然後運行以下perl腳本：

#!/usr/bin/perl                                   
use strict; 
use warnings; 
require "Encode.pm"; 
require "utf8.pm"; 

open FILE, "test.txt" or die $!; 
my @lines = <FILE>; 
my $test = $lines[0]; 

print "Length: " . (length $test) . "\n"; 
print "utf8 flag: " . utf8::is_utf8($test) . "\n"; 
my @unicode = (unpack('U*', $test)); 
print "Unicode:\[email protected]\n"; 
my @hex = (unpack('H*', $test)); 
print "Hex:\[email protected]\n"; 

print "==============\n"; 

$test = Encode::decode("utf8", $test); 
print "Length: " . (length $test) . "\n"; 
print "utf8 flag: " . utf8::is_utf8($test) . "\n"; 
@unicode = (unpack('U*', $test)); 
print "Unicode:\[email protected]\n"; 
@hex = (unpack('H*', $test)); 
print "Hex:\[email protected]\n"; 

print "==============\n"; 

$test = Encode::decode("utf8", $test); 
print "Length: " . (length $test) . "\n"; 
print "utf8 flag: " . utf8::is_utf8($test) . "\n"; 
@unicode = (unpack('U*', $test)); 
print "Unicode:\[email protected]\n"; 
@hex = (unpack('H*', $test)); 

print "Hex:\[email protected]\n";

這給出了以下的輸出：

Length: 7 
utf8 flag: 
Unicode: 
195 168 194 171 194 139 10 
Hex: 
c3a8c2abc28b0a 
============== 
Length: 4 
utf8 flag: 1 
Unicode: 
232 171 139 10 
Hex: 
c3a8c2abc28b0a 
============== 
Length: 2 
utf8 flag: 1 
Unicode: 
35531 10 
Hex: 
e8ab8b0a

這是我所期望的那樣。長度最初是7，因爲perl認爲$ test只是一系列字節。在解碼一次之後，perl知道$ test是一系列utf8編碼的字符（即不是返回7字節的長度，perl返回長度爲4個字符，即使$ test在內存中仍然是7字節）。第二次解碼後，$ test包含4個字節，解釋爲2個字符，這是我所期望的，因爲Encode :: decode取4個編碼點並將它們解釋爲utf8編碼的字節，結果爲2個字符。奇怪的是，當我修改代碼來調用utf8 :: decode代替（用utf8 :: decode（$ test）替換所有$ test = Encode :: decode（「utf8」，$ test））

This給出幾乎相同的輸出，僅長度的結果不同：

 
Length: 7 
utf8 flag: 
Unicode: 
195 168 194 171 194 139 10 
Hex: 
c3a8c2abc28b0a 
============== 
Length: 4 
utf8 flag: 1 
Unicode: 
232 171 139 10 
Hex: 
c3a8c2abc28b0a 
============== 
Length: 4 
utf8 flag: 1 
Unicode: 
35531 10 
Hex: 
e8ab8b0a

好像perl的解碼（如預期），則第一解碼後計數的字符之前第一計數的字節數，但在此之後再次計數字節第二次解碼（不是預期的）。爲什麼會發生這種轉換？我對理解這些解碼函數的工作方式是否存在失誤？

謝謝
馬特

來源

2010-12-02 Matt

爲什麼你需要模塊而不是使用它們？ – 2010-12-02 21:08:50

我沒有use utf8，因爲這樣做會告訴perl你的代碼本身是utf8編碼的，我不需要（http://perldoc.perl.org/utf8.html）。我想我可以use D編碼，但我恰巧不是。 – Matt 2010-12-02 21:41:36

你不應該從utf8編譯模塊使用的功能。 Its documentation這麼說：

不要使用這個編譯指示來告訴Perl你的腳本是用UTF-8編寫的。

Always use the Encode module，並且還看到問題Checklist for going the Unicode way with Perl。 unpack太低級別，它甚至沒有給你錯誤檢查。

的octects E8 AB 86 0A是UTF-8編碼雙人物諆和newline的結果你會錯誤的假設。這是這些字符的單個UTF-8編碼的表示。也許整個你身邊的困惑都源於這個錯誤。

length被不恰當地重載，在某些時候它確定了字符長度或八位字節長度。使用更好的工具，如Devel::Peek。

#!/usr/bin/env perl 
use strict; 
use warnings FATAL => 'all'; 
use Devel::Peek qw(Dump); 
use Encode qw(decode); 

my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}"; 
# or read the octets without implicit decoding from a file, does not matter 

Dump $test; 
# FLAGS = (PADMY,POK,pPOK) 
# PV = 0x8d8520 "\350\253\206\n"\0 

$test = decode('UTF-8', $test, Encode::FB_CROAK); 
Dump $test; 
# FLAGS = (PADMY,POK,pPOK,UTF8) 
# PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]

來源

2010-12-03 14:04:04 daxim

原來這是一個bug：https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190。

來源

2011-10-21 18:45:00 Matt

Perl：utf8 :: decode與Encode :: decode

回答

相關問題