獲得使用MyParser從HTML標籤的內容在Perl

我有一個HTML如下所示：獲得使用MyParser從HTML標籤的內容在Perl

<!DOCTYPE html 
    PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> 
<head> 
<title></title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 
</head> 
<body bgcolor="white"> 

<h1>foo.c</h1> 

<form method="post" action="" 
     enctype="application/x-www-form-urlencoded"> 
    Compare this file to the similar file: 
    <select name="file2"> 

    <option value="...">...</option> 


    </select> 
    <input type="hidden" name="file1" value="foo.c" /><br> 
    Show the results in this format: 
</form> 
<hr> 

<p> 
<pre> 
some code 
</pre>

我需要輸入名字=「文件」和HTML預標記的內容的價值。我不知道在Perl語言，通過谷歌搜索我寫了這個小程序（我認爲是不「雅」）：

#!/usr/bin/perl 

package MyParser; 
use base qw(HTML::Parser); 

#Store the file name and contents obtaind from HTML Tags 
my($filename, $file_contents); 

#This value is set at start() calls 
#and use in text() routine.. 
my($g_tagname, $g_attr); 


#Process tag itself and its attributes 
sub start { 
    my ($self, $tagname, $attr, $attrseq, $origtext) = @_; 

    $g_tagname = $tagname; 
    $g_attr = $attr; 
} 

#Process HTML tag body 
sub text { 
    my ($self, $text) = @_; 

    #Gets the filename 
    if($g_tagname eq "input" and $g_attr->{'name'} eq "file1") { 
    $filename = $attr->{'value'}; 
    } 

    #Gets the filecontents 
    if($g_tagname eq "pre") { 
    $file_contents = $text; 
    } 
} 

package main; 

#read $filename file contents and returns 
#note: it works only for text/plain files. 
sub read_file { 
    my($filename) = @_; 
    open FILE, $filename or die $!; 
    my ($buf, $data, $n); 
    while((read FILE, $data, 256) != 0) { 
    $buf .= $data; 
    } 
    return ($buf); 
} 


my $curr_filename = $ARGV[0]; 
my $curr_file_contents = read_file($curr_filename); 

my $parser = MyParser->new; 
$parser->parse($curr_file_contents); 

print "filename: ",$filename,"file contents: ",$file_contents;

然後我打電話./foo.pl html.html但我從$filename和$file_contents越來越空值變量。

如何解決這個問題？非常讚賞改進代碼的建議。

來源

2012-11-18 Jack

始終使用'使用嚴格的;使用警告;' – ikegami

像往常一樣，不止一種方法去做一件事。以下是如何使用的Mojolicious的DOM Parser此任務：

#!/usr/bin/env perl 

use strict; 
use warnings; 
use Mojo::DOM; 

# slurp all lines at once into the DOM parser 
my $dom = Mojo::DOM->new(do { local $/; <> }); 

print $dom->at('input[name=file1]')->attr('value'); 
print $dom->at('pre')->text;

輸出：

foo.c 
some code

來源

2012-11-18 10:12:50 memowe

由於OP有一個作爲參數給出的輸入文件，可以使用「magic open」菱形運算符（'do {local $ /; <>}'）或者使用Mojo :: Util :: slurp（$ ARGV [0 ]）'在這裏更有意義。否則，很好的演示！ –

謝謝，@JoelBerger，已更正。 :) – memowe

@memowe感謝您的回答，因爲它只是幫助我解決了類似的問題，但是您的答案中有一個錯字 - 應該是 - > attr（'value'）而不是attrs。問候 –

除非你正在編寫自己的解析模塊或者做一些棘手的事情，否則通常不需要使用純HTML :: Parser。在這種情況下，HTML::TreeBuilder是HTML :: Parser的子類，是最容易使用的。

另外，還要注意HTML解析器::有parse_file方法（和HTML :: TreeBuilder作爲使得它更容易與new_from_file方法，這樣你就不必做這一切read_file業務（再說，有。更好的方法來做到這一點比你挑選，包括File::Slurp和老do { local $/; <$handle> }招一個

use HTML::TreeBuilder; 

my $filename = $ARGV[0]; 
my $tree = HTML::TreeBuilder->new_from_file($filename); 

my $filename = $tree->look_down(
    _tag => 'input', 
    type => 'hidden', 
    name => 'file1' 
)->attr('value'); 

my $file_contents = $tree->look_down(_tag => 'pre')->as_trimmed_text; 

print "filename: ",$filename,"file contents: ",$file_contents;

有關look_down，attr信息，並as_trimmed_text，看到HTML::Element文檔，HTML :: TreeBuilder作爲既爲a，與...一起工作，

來源

2012-11-18 06:45:42 hobbs

使用xpath HTML::TreeBuilder::XPath和模塊Perl（很少行）：

#!/usr/bin/env perl 
use strict; use warnings; 
use HTML::TreeBuilder::XPath; 

my $tree = HTML::TreeBuilder::XPath->new_from_content(<>); 
print $tree->findvalue('//input[@name="file1"]/@value'); 
print $tree->findvalue('//pre/text()');

USAGE

./script.pl file.html

OUTPUT

foo.c 
some code

筆記

以前我用HTML::TreeBuilder模塊做一些網頁抓取。現在，我不能回到複雜性。 HTML::TreeBuilder::XPath用有用的Xpath表達式來完成所有的魔術。
可以使用new_from_file方法來打開一個文件或文件句柄，而不是new_from_content，以這種方式使用<>在這裏允許的，因爲HTML::TreeBuilder::new_from_content()特別允許讀這樣多行看到perldoc HTML::TreeBuilder（HTML::TreeBuilder::XPath從HTML::TreeBuilder繼承方法）
。大多數構造函數不會允許這種用法。您應該提供一個標量，或使用其他方法。

來源

2012-11-18 07:58:21

增加_usage_部分，並用_diamond operator_替換_DATA trick_以打開文件作爲參數。 –

對於將來的讀者請注意：以這種方式使用'''''是允許的，因爲'HTML :: TreeBuilder :: new_from_content（）'特別允許以這種方式讀取多行。大多數構造函數不會允許這種用法，並且需要'do {local $ /; <>}'將所有內容讀入一個變量（參數）。 –

獲得使用MyParser從HTML標籤的內容在Perl

回答

相關問題