我正在寫一個Perl腳本,它從許多不同的網站獲取各種HTML文檔,並試圖從中提取數據。解碼這些文檔時遇到問題。哪個Perl編碼爲哪個HTML字符集?
我知道如何從一個meta標籤讀取charset
如果有,如何讀取如果HTTP標頭提供這些信息。
的結果可能是:
- UTF-8
- ISO-8859-1
- SHIFT_JIS
- 的Windows 1252
,還有更多
隨着這個知識我想在我的Perl腳本中解碼文檔
#!/usr/bin/perl -w
use strict;
use LWP::UserAgent;
use Encode;
use Encode::JP;
# Maybe also use other extensions for Encode
my $ua = LWP::UserAgent->new;
my $response = $ua->get($url); #$url is the documents URL
if ($response->is_success) {
my $charset = getcharset($response);
# getcharset is a self-written subroutine that reads the charset
# from a meta tag or from the HTTP header (not shown in this example)
# Now I know the documents charset and want to find its encoding:
my $encoding = 'utf-8'; # default
if ($charset eq 'utf-8') {
$encoding = 'utf-8'; # Here $encoding and $charset are equal
}
elsif ($charset eq 'Shift_JIS') {
$encoding = 'shiftjis'; #here $encoding and $charset are not equal
}
elsif ($charset eq 'windows-1252') {
# Here I have no idea what $encoding should be, since there is no
# encoding in the documentation that contains the string "windows"
}
elsif ($charset eq 'any other character set') {
$encoding = ???
}
my $content = decode($encoding, $result->content);
# Extract data from $content
}
但是我無法找到一些在野外存在的字符集的正確編碼。
你應該'優先使用warnings'爲'-w'的家當線 – Borodin