用LWP :: UserAgent檢索ISO-8859-1編碼網站的正確方法？

我正在通過使用'LWP :: UserAgent'使用以下代碼檢索'ISO-8859-1'編碼的網站。用LWP :: UserAgent檢索ISO-8859-1編碼網站的正確方法？

問題是，特殊字符顯示不正確，尤其是「€」符號顯示錯誤。

內容編碼被識別爲'ISO-8859-1'，這是正確的。

要顯示檢索到的文本，我將它保存到一個文件中並用Notepag ++打開它。

問題：如何以正確的方式檢索'ISO-8859-1'編碼的特殊字符？

 

#SENDING REQUEST 
my $ua = LWP::UserAgent->new(); 
$ua->agent('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'); # pretend we are very capable browser 

my $req = HTTP::Request->new(GET => $url); 

#add some header fields 
$req->header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'); 
$req->header('Accept-Language', 'en;q=0.5'); 
$req->header('Connection', 'keep-alive'); 
$req->header('Host', 'www.url.com'); 

#SEND 
my $response = $ua->request($req); 

#decode trial1 
print $response->content_charset(); # gives ISO-8859-1 which is right 
my $content = $response->decoded_content(); #special chars are displayed wrong 

#decode trial2 
my $decContent = decode('ISO-8859-1', $response->content()); 
my $utf8Content = encode('utf-8', $decContent); #special char € is displayed as Â 

#decode trial3 
Encode::from_to($content, 'iso-8859-1', 'utf8'); #special char € is displayed as Â too 


#example on writing data to file 
open(MYOUTFILE, ">>D:\\encodingperl.html"); #open for write, overwrite 
print MYOUTFILE "$utf8Content"; #write text 
close(MYOUTFILE);

來源

2012-11-04 Skip

你忘由binmoding輸出手柄來設置輸出編碼爲utf8。 – tchrist

與任何其他：

my $content = $response->decoded_content();

這就是說，iso-8859-1字符集不包括歐元符號。你可能實際上有cp1252。你可以修復如下：

my $content = $response->decoded_content(charset => 'cp1252');

你的第二個問題是，你不編碼你的輸出。這是你怎麼做的。

open(my $MYOUTFILE, '>>:encoding(cp1252)', 'D:\\encodingperl.html') 
    or die $!; 
print $MYOUTFILE $content;

使用，如果它不是你想要cp1252這就是適合你（例如UTF-8）的編碼。如果你想在原始編碼的原始文件，使用

my $content = $response->decoded_content(charset => 'none');

和

open(my $MYOUTFILE, '>>', 'D:\\encodingperl.html') 
    or die $!; 
binmode($MYOUTFILE); 
print $MYOUTFILE $content;

來源

2012-11-04 03:57:14 ikegami

我沒有意識到''coded_content'有'charset'參數;我打算建議使用'content'加手動的'Encode :: decode'。無論如何，這是正確的答案，要注意的是編碼*可以是「ISO-8859-15」而不是「cp1252」。 – hobbs

@hobbs，我不明白。 iso-8859-15的「€」是'A4'。如果告訴它是iso-8859-1，瀏覽器將顯示「A4」爲「¤」。當被告知它是iso-8859-1時，瀏覽器顯示「€」的唯一方法是如果它遇到'80'，cp1252的「€」。請記住，cp1252是iso-8859-1的超集。 – ikegami

@hobbs，切勿使用'content'。它沒有擺脫一些gzip。如果你想手動解碼，你可以使用'my $ content = decode（...，$ response-> decoded_content（charset =>'none'））;'。 – ikegami

ISO-8859-1沒有歐元符號。如果您需要歐元符號，您應該使用ISO-8859-15或更好的UTF-8。

來源

2012-11-04 00:55:10

Thnx但它不是我的網站，所以編碼ISO-8859-1我不能改變。不知何故瀏覽器確實顯示正確，即使它不在ISO-8859-1中。有沒有辦法將整個字符串轉換爲UTF8保持€符號？ – Skip

瀏覽器在位置128處顯示帶有ISO-8859-1的歐元符號。這很方便，但技術上是錯誤的。可能的工作是將ISO-8859-1重新編碼爲ISO-8859-15，將所有歐元符號字符128替換爲164（ISO-8859-15中的歐元符號）並將其重新編碼爲UTF-8。 –

用LWP :: UserAgent檢索ISO-8859-1編碼網站的正確方法？

回答

相關問題