2014-02-21 56 views
2

下相鄰單元就在眼前的例子:關聯Perl的網站::刮板

#!/usr/bin/perl 
use strict; 
use Web::Scraper; 
use Data::Dumper; 

my $html = q[ 
<html> 
    <body> 
    <div class="mainContainer"> 
     <div class="when">February 20, 2014</div> 
     <div class="name">Name 1</div> 
     <div class="desc">Desc 1</div> 
     <div class="when">February 21, 2014</div> 
     <div class="name">Name 2</div> 
     <div class="desc">Desc 2</div> 
     <div class="name">Name 3</div> 
     <div class="desc">Desc 3</div> 
     <div class="when">February 22, 2014</div> 
     <div class="name">Name 4</div> 
     <div class="desc">Desc 4</div> 
    </div> 
    </body> 
</html> 
]; 

my $scraper = scraper { 
    process ".when", "events[]" => scraper { 
     my $when = $_->content(); 
     my $hash = {}; 
     $hash->{$when}->{name} = "NAME"; 
     $hash->{$when}->{desc} = "DESC"; 
     return $hash; 
    }; 
}; 

my $result = $scraper->scrape($html); 

print Dumper($result); 

我所試圖做的是日期,與事件的詳細信息相關聯。正如你所看到的,div並不是嵌套的,所以它不是微不足道的(至少對我而言)。另外每個活動都由namedesc組成。我沒有找到一種方法使用css選擇器將所需結構中的相鄰元素相關聯。我想我會需要一個自定義的子程序返回來做這些元素的關聯。我想找回類似於下面的內容:

[ 
'February 20, 2014' => [ 
    { 
    'name' => 'Name 1', 
    'desc' => 'Desc 1' 
    } 
], 
'February 21, 2014' => [ 
    { 
    'name' => 'Name 2', 
    'desc' => 'Desc 2' 
    }, 
    { 
    'name' => 'Name 3', 
    'desc' => 'Desc 3' 
    } 
], 
'February 22, 2014' => [ 
    { 
    'name' => 'Name 4', 
    'desc' => 'Desc 4' 
    } 
] 
] 

回答

0

你可能會首先獲得數據,然後刮板後處理這些得到更好的服務。所以...:

my $scraper = scraper { 
    process ".when", "dates[]" => "TEXT"; 
    process ".name", "names[]" => "TEXT"; 
    process ".desc", "desc[]" => "TEXT"; 
}; 

my $result = $scraper->scrape($html); 

# Here you would start processing these 

my @dates = @{ $result->{dates} }; 
my @names = @{ $result->{names} }; 
my @info = @{ $result->{desc} }; 
my %events; 

for (my $i = 0; $i < scalar @dates; $i++) { 
    my $date = $dates[$i]; 
    my $name = $names[$i]; 
    my $info = $info[$i]; 
    if (exists $events{$date}) { 
    push @{ $events{$date} }, { 'name' => $name, 'desc' => $info }; 
    } 
    else { 
    $events{$date} = [{ 'name' => $name, 'desc' => $info}]; 
    } 
} 

%事件會有你需要的數據。這一切都假設你仍然需要這個,每個事件日期後面都有一個名字和描述。另外,我還沒有測試過這個。