使用CSS3進行Doctype嗅探，特別是使用Mojo :: DOM

我可以使用Mojo::DOM及其CSS3選擇器來找出HTML文檔的DOCTYPE？關於我的其他問題，How should I process HTML META tags with Mojo::UserAgent?我想設置文檔的字符集，我需要知道要看什麼，並且doctype sniffing似乎是要做的。當文檔設置覆蓋服務器設置（或不設置）時，HTML和HTML 5對於HTML中的charsets具有不同的元標記。使用CSS3進行Doctype嗅探，特別是使用Mojo :: DOM

我完成任務沒有問題，因爲我可以抓住原始響應並使用正則表達式來查看DOCTYPE。 Since the browser DOMs seem to be able to get the DOCTYPE，我感染了我應該能夠得到它的想法。然而，缺乏例子讓我認爲沒有人會以我認爲我應該這樣做的方式去做。

我試過很多愚蠢的方式，但我的CSS功夫弱：

use v5.20; 

use feature qw(signatures); 
no warnings qw(experimental::signatures); 

use Mojo::DOM; 

my $html = do { local $/; <DATA> }; 

my $dom = Mojo::DOM->new($html); 

say "<title> is => ", $dom->find('head title')->map('text')->each; 

say "Doctype with find is => ", $dom->find('!doctype')->map('text')->each; 

say "Doctype with nodes is => ", $dom->[0]; 

__DATA__ 

<!DOCTYPE html> 
<head> 
<title>This is a title</title> 
</head> 
<body> 
<h1>Level 1</h1> 
</body> 
</html>

當我轉儲$dom對象，我看到樹上的DOCTYPE：

$VAR1 = bless(do{\(my $o = bless({ 
         'tree' => [ 
            'root', 
            [ 
            'text', 
            '', 
            ${$VAR1}->{'tree'} 
            ], 
            [ 
            'doctype', 
            ' html', 
            ${$VAR1}->{'tree'} 
            ],

現在怎麼辦我明白了嗎？

來源

2015-09-09 brian d foy

似乎沒有這樣做的內置方式，因爲doctype存儲在樹的頂層。我能想到的最好的辦法是：'my％ugly_tree = @ {$ dom-> root-> tree};我的（$ doctype）= $ ugly_tree {doctype};'但這可能很脆弱。 – xxfelixxx

doctype只是另一個節點，您可以使用[child_nodes]（http://mojolicio.us/perldoc/Mojo/DOM#child_nodes）等方法找到它們。 –

我仍然認爲有更好的方式來做到這一點的希望，但也許我對Mojo::UserAgent承擔了太多的責任。我可以建立一個交易，並向響應添加一個finish事件。在這種情況下，我用正則表達式嗅探內容，並添加一個X-頭文件類型。我可能會通過一些其他方式的信息，但是這不是重點（還是採取建議雖然！）

use v5.14; 

use Mojo::UserAgent; 

@ARGV = qw(http://blogs.perl.org); 

my $ua = Mojo::UserAgent->new; 

my $tx = $ua->build_tx(GET => $ARGV[0]); 
$tx->res->on(finish => sub { 
    my $res = shift; 
    my($doctype) = $res->body =~ m/\A \s* (<!DOCTYPE.*?>)/isx; 
    if($doctype) { 
     say "Found doctype => $doctype"; 
     $res->headers->header('X-doctype', $doctype); 
     } 
    }); 
$tx = $ua->start($tx); 

say "-----Headers-----"; 
say $tx->res->headers->to_string =~ s/\R+/\n/rg;

下面是輸出：

Found doctype => <!DOCTYPE html> 
-----Headers----- 
Connection: Keep-Alive 
Server: Apache/2.2.12 (Ubuntu) 
Content-Type: text/html 
Content-Length: 20624 
Accept-Ranges: bytes 
X-doctype: <!DOCTYPE html> 
Last-Modified: Wed, 16 Sep 2015 13:08:26 GMT 
ETag: "26d42e8-5090-51fdcfe768680" 
Date: Wed, 16 Sep 2015 13:40:02 GMT 
Keep-Alive: timeout=15, max=100 
Vary: Accept-Encoding

現在我必須考慮各種事情解析DOCTYPE值並根據那些內容做出決定。

來源

2015-09-16 13:44:33

確定HTML5文檔的編碼非常簡單complex。恐怕Mojo::DOM只是一個片段解析器，因此我們已經決定完全實現編碼嗅探算法將超出範圍。大多數網站都是UTF-8編碼，我想這就是爲什麼這個問題不經常出現。

來源

2015-09-17 21:34:46

我的問題更容易一點：我只想確定他們所說的字符集即使它是錯誤的。有一個特別錯誤配置的Web服務器引發了這個問題。 :) –

使用CSS3進行Doctype嗅探，特別是使用Mojo :: DOM

回答

相關問題