如何在Perl中使用正則表達式分析多行HTML

我想解析出一個多行字符串使用Perl，但我只得到匹配的數量。下面是我解析的樣品：如何在Perl中使用正則表達式分析多行HTML

<div id="content-ZAJ9E" class="content"> 
     Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. 
</div>

我試圖讓內容來使用此代碼存儲在一個字符串：

@a = ($html =~ m/class="content">.*<\/div>/gs); 
print "array A, size: ", @a+0, ", elements: "; 
print join (" ", @a); 
print "\n";

，但它不是返回整個事情只是文本在div中。有人能指出我的正則表達式中的錯誤嗎？

Marisa

來源

2012-06-21 Marisa Giancarla

如果其中一個答案解決您的問題，請[接受]（http://stackoverflow.com/faq#howtoask），以便其他人可以看到它是有益的。 – simbabque

你只匹配字符串，你不解析任何東西。如果你想在div中間的文字，你應該說：

$html =~ m/class="content">(.*)<\/div>/gs; 
my $text = $1; 
print $text;

你的比賽將被存儲在$1變量。如果有這樣的div[class=content]的多個實例，你需要這樣一個循環：

use strict; use warnings; 
use Data::Dumper; 

my $html = qq~<div id="content-ZAJ9E" class="content"> 
     Wow, I love the new top bar. 
</div> 
<div id="content-ZAJ9E" class="content"> 
     I still love it. 
</div> 
<div id="content-ZAJ9E" class="content"> 
     I cant get enough! 
</div> 
~; 

my @matches; 
# *? makes it non-greedy so it will only match to the first </div> 
while ($html =~ m/class="content">(.*?)<\/div>/gs){ 
    my $group = $1;  
    $group =~ s/^\s+//; # strip whitespace at the beginning 
    $group =~ s/\s+$//; # and the end 

    push @matches, $group; 
} 
print Dumper \@matches;

我建議你看一看perlre和perlretut。

一些注意事項：

始終use strict和use warnings！
嘗試Data::Dumper，這是很好的調試變量。
使用正則表達式進行HTML解析不是最好的主意。如果你正在做很多分析，考慮在CPAN可用的模塊之一，諸如HTML::Parser，HTML::TreeBuilder::XPath，或者HTML::TokeParser::Simple，或Mojo::DOM，或search for it on SO

來源

2012-06-21 13:32:00 simbabque

感謝修復思南，我有點匆忙。 :) – simbabque

就是這樣，謝謝！ –

使用強大的HTML解析器：

use strictures; 
use Web::Query qw(); 
my $w = Web::Query->new_from_html(<<'HTML'); 
<div id="content-ZAJ9E" class="content"> 
     Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. 
</div> 
HTML 
$w->find('div.content')->text

表達式返回Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.

來源

2012-06-21 14:06:52 daxim

使用的東西，旨在解析HTML，如HTML::TreeBuilder::XPath：

#!/usr/bin/env perl 

use strict; use warnings; 
use 5.014; 
use HTML::TreeBuilder::XPath; 
use YAML; 

my $doc =<<EO_HTML; 
<div id="content-ZAJ9E" class="content"> 
<!-- begin <div> --> 
     Wow, I love the new top bar, so much easier to navigate now :) 
     Anywho, got a few other fixes I am working on as well. :) 
     I hope you all like the new look. 
<!-- end </div> --> 
<span class="extra">Here I am</span> 
</div> 
EO_HTML 

use HTML::TreeBuilder::XPath; 
my $tree= HTML::TreeBuilder::XPath->new; 
$tree->store_comments(1); 
$tree->parse($doc); 

print Dump [ $tree->findvalues('//div[@class="content"]') ]; 
print Dump [ $tree->findvalues('//*[@class="extra"]') ]; 
print Dump [ $tree->findvalues('//comment()') ];

注意不依賴於處理各種輸入變量的自制正則表達式模式所提供的能力。

輸出：

--- 
- ' Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. Here I am ' 
--- 
- Here I am 
--- 
- ' begin <div> ' 
- ' end </div> '

來源

2012-06-21 14:13:46

如何在Perl中使用正則表達式分析多行HTML

回答

相關問題