perl從模式

之間的文本文件中獲取多行我有一個html文件，其中包含我必須推送到MySql數據庫的數據。我嘗試解析html文件以獲得我需要的標量值，但是當我需要收集不是來自單行文本的數據，而是收集某些模式之間的多行數據時，我遇到了問題。這是我走到這一步，這有點兒工作：perl從模式

#!/usr/bin/perl 
    binmode STDOUT,':encoding(cp1250)'; 

    open FILE, "index.html" or die "Could not open $file: $!"; 
    my $word; 
    my $description; 
    my $origin; 

    while (my $line = <FILE>) 
    { 
    if ($line =~ m/(?<=<h2 class=\"featured\">)(.*)(?=<\/h2>)/) 
    { 
    $word = $line =~ m/<=<h2 class=\"featured\">(.*)<\/h2>/; 
    $word = $1;  
    } 

    if ($line =~ m/(?<=<h4 class=\"related-posts\">)/) 
    { 
    print $line; 
    $origin = $line =~ m/<h4 class=\"related-posts\"> <a href=\"..\/tag\/lacina\/index.html\" rel=\"tag\">(.*)<\/a><\/h4>/; 
    $origin = $1;  
    } 


    } 

print "$word \n"; 
print "$origin";

現在我想要抓住一個文本的幾行 - 不必在單個標，但我不知道有多少行會有。我所知道的是，線在之間：

<div class="post-content"> 

<p>text I want</p> 
<p>1.text I want</p> 
<p>2.text I want</p> 

<div class="box small arial">

另外，我想擺脫

 <p>'s

我想讀一條線，將其存儲在scaral，閱讀另一條線並與最近保存的標量進行比較。但是我如何用這個標量來檢查我是否擁有所有我想要的東西？

來源

2014-06-29 Lenny

用於作業，而不是一個正則表達式的工具。

use strict; 
use warnings; 
use feature 'say'; 
use HTML::TreeBuilder; 

my $tr = HTML::TreeBuilder->new_from_file('index.html'); 

for my $div ($tr->look_down(_tag => 'div', 'class' => 'post-content')) { 
    for my $t ($div->look_down(_tag => 'p')) { 
    say $t->as_text; 
    } 
}

輸出

text I want 1.text I want 2.text I want

來源

2014-06-29 18:37:47 hwnd

謝謝！簡單明瞭的方式來獲得我需要的東西！ – Lenny

還有一件事：如何在direcotry樹中使用此腳本，以便我想搜索子文件夾並在每個index.html文件上運行此腳本？ – Lenny

您可以使用File :: Find或grep來遍歷子文件夾。這裏是一個[示例]（http://stackoverflow.com/questions/15303270/perl-finding-a-file-based-off-its-extension-through-all-subdirectories） – hwnd

使用range operator查找的文本兩種模式之間：

use strict; 
use warnings; 

while (<DATA>) { 
    if (my $range = /<div class="post-content">/ .. /<div class="box small arial">/) { 
     next if $range =~ /E/; 
     print; 
    } 
} 

__DATA__ 
<html> 
<head><title>stuff</title></head> 
<body> 
<div class="post-content"> 
<p>text I want</p> 
<p>1.text I want</p> 
<p>2.text I want</p> 
</div> 
<div class="box small arial"> 
</div> 
</body> 
</html>

輸出：

<div class="post-content"> 
<p>text I want</p> 
<p>1.text I want</p> 
<p>2.text I want</p> 
</div>

然而，真正的答案是使用一個實際的HTML解析器解析HTML。我想推薦Mojo::DOM。對於有幫助的8分鐘介紹性視頻，請查看Mojocast Episode 5。

use strict; 
use warnings; 

use Mojo::DOM; 

my $data = do {local $/; <DATA>}; 

my $dom = Mojo::DOM->new($data); 

for my $div ($dom->find('div[class=post-content]')->each) { 
    print $div->all_text(); 
} 

__DATA__ 
<html> 
<head><title>stuff</title></head> 
<body> 
<div class="post-content"> 
<p>text I want</p> 
<p>1.text I want</p> 
<p>2.text I want</p> 
</div> 
<div class="box small arial"> 
</div> 
</body> 
</html>

輸出：

text I want 1.text I want 2.text I want

來源

2014-06-29 18:23:46 Miller

謝謝全面的解釋！我使用了Mojo：DOM，但瞭解如何使用範圍運算符將在未來幫助我很多！ – Lenny

回答

相關問題