2012-06-21 47 views
1

我想解析出一個多行字符串使用Perl,但我只得到匹配的數量。下面是我解析的樣品:如何在Perl中使用正則表達式分析多行HTML

<div id="content-ZAJ9E" class="content"> 
     Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. 
</div> 

我試圖讓內容來使用此代碼存儲在一個字符串:

@a = ($html =~ m/class="content">.*<\/div>/gs); 
print "array A, size: ", @a+0, ", elements: "; 
print join (" ", @a); 
print "\n"; 

,但它不是返回整個事情只是文本在div中。有人能指出我的正則表達式中的錯誤嗎?

Marisa

+1

如果其中一個答案解決您的問題,請[接受](http://stackoverflow.com/faq#howtoask),以便其他人可以看到它是有益的。 – simbabque

回答

4

你只匹配字符串,你不解析任何東西。如果你想在div中間的文字,你應該說:

$html =~ m/class="content">(.*)<\/div>/gs; 
my $text = $1; 
print $text; 

你的比賽將被存儲在$1變量。如果有這樣的div[class=content]的多個實例,你需要這樣一個循環:

use strict; use warnings; 
use Data::Dumper; 

my $html = qq~<div id="content-ZAJ9E" class="content"> 
     Wow, I love the new top bar. 
</div> 
<div id="content-ZAJ9E" class="content"> 
     I still love it. 
</div> 
<div id="content-ZAJ9E" class="content"> 
     I cant get enough! 
</div> 
~; 

my @matches; 
# *? makes it non-greedy so it will only match to the first </div> 
while ($html =~ m/class="content">(.*?)<\/div>/gs){ 
    my $group = $1;  
    $group =~ s/^\s+//; # strip whitespace at the beginning 
    $group =~ s/\s+$//; # and the end 

    push @matches, $group; 
} 
print Dumper \@matches; 

我建議你看一看perlreperlretut


一些注意事項:

+0

感謝修復思南,我有點匆忙。 :) – simbabque

+0

就是這樣,謝謝! –

7

使用強大的HTML解析器:

use strictures; 
use Web::Query qw(); 
my $w = Web::Query->new_from_html(<<'HTML'); 
<div id="content-ZAJ9E" class="content"> 
     Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. 
</div> 
HTML 
$w->find('div.content')->text 

表達式返回Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.

5

使用的東西,旨在解析HTML,如HTML::TreeBuilder::XPath

#!/usr/bin/env perl 

use strict; use warnings; 
use 5.014; 
use HTML::TreeBuilder::XPath; 
use YAML; 

my $doc =<<EO_HTML; 
<div id="content-ZAJ9E" class="content"> 
<!-- begin <div> --> 
     Wow, I love the new top bar, so much easier to navigate now :) 
     Anywho, got a few other fixes I am working on as well. :) 
     I hope you all like the new look. 
<!-- end </div> --> 
<span class="extra">Here I am</span> 
</div> 
EO_HTML 

use HTML::TreeBuilder::XPath; 
my $tree= HTML::TreeBuilder::XPath->new; 
$tree->store_comments(1); 
$tree->parse($doc); 

print Dump [ $tree->findvalues('//div[@class="content"]') ]; 
print Dump [ $tree->findvalues('//*[@class="extra"]') ]; 
print Dump [ $tree->findvalues('//comment()') ]; 

注意不依賴於處理各種輸入變量的自制正則表達式模式所提供的能力。

輸出:

--- 
- ' Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. Here I am ' 
--- 
- Here I am 
--- 
- ' begin <div> ' 
- ' end </div> '