2012-09-26 16 views
0

我試圖抓住HTML文件存儲在本地的所有鏈接並構建哈希 我使用File :: Find來獲取html文件,但已經離開了那個的代碼。從HTML收集鏈接和構建perl哈希

  1. 第一散列密鑰將是標題
  2. 所述第二密鑰的所述反射鏡的
  3. 第三密鑰的一部分,則該URL

$hash{$title}{$mirror}{$part}=$url; 

我可以得到具有單個部件的鏈接&單鏡像,但我目前沒有獲得多個部件我卡在一個廁所頁。 我的模式越來越鏡相匹配的網址,但我如何得到這個角色,如果它存在別的$一部分=「part_1」 然後我需要移動到下一個URL

#!/usr/bin/perl 

my $Html = qq(
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 
    <html> 
     <head> 
     <meta http-equiv="content-type" content="text/html; charset=windows-1250"> 
     <meta name="generator" content="PSPad editor, www.pspad.com"> 
     <title>First hash key</title> 
     </head> 
     <body> 
     <div> 
     <br><b>Multi Links</b><br><br><!--colorstart:#FF0000--> 

     <span style="color:#FF0000"><!--/colorstart--><b>Mirror 1</b><!--colorend--></span><!--/colorend--> 
     <br><a href="http://mirror1.com/rvvaq1hi" target="_blank"><b>Part 1</b></a> 
     <br><a href="http://mirror1.com/w33h9ym2" target="_blank"><b>Part 2</b></a> 
     <br><a href="http://mirror1.com/fdnppn15" target="_blank"><b>Part 3</b></a></div> 

     </div> 
     <div> 
     <br><b>Single link multiple mirrors</b><br> 
     <br><a href="http://mirror1.com/t2wx9603" target="_blank"><!--colorstart:#FF0000--><span style="color:#FF0000"><!--/colorstart--><b>Mirror 1</b><!--colorend--></span><!--/colorend--></a></div> 
     <br><a href="http://mirror2.com/t2wx9603" target="_blank"><!--colorstart:#FF0000--><span style="color:#FF0000"><!--/colorstart--><b>Mirror 2</b><!--colorend--></span><!--/colorend--></a></div>  

     </div> 

     </body> 
    </html> 
); 
my @html = split(\n,$Html); 
    my $TheMain; 
    my $Title; 
    my @Names=(Mirror1,Mirror2,Mirror3); 
    my %hash; 

     foreach my $line (@html) 
     { 
     print "Da Line [$line]\n"; 
     if ($line =~ m{<title>(.*?)</title>}) 
      { 
      $Title = $1; 
      print "$Title\n"; 
      } 
     $line =~ s/\"/'/g; # Double quotes to single 
     $line=~ s{\n}{}g; #remove \n 
     $line=~ s{\s+}{ }g;#remove excessive spaces 

      $TheMain = $TheMain . $line; 
     } 
     print "$TheMain\n"; 
    unless ($TheMain eq "") # unless empty enter the loop 
     { 
     while ($TheMain =~ m{a href=(.*?)/a}) 
     { 
      my $A = $1; 
      print "the A $A\n"; ## stuck in a loop 
      my ($url,$part); 
      $A =~ s/<.*?color.*?>//ig; 
      while ($A =~ m{\'(http.*?)\'.*?<b>(.*?)</b> }gi) 
       { 
       $url = $1; 
       $part = $2; 
       if ($part =~m/part/i) 
       { 
        $part =~ s/ /_/; 
       } 
       else 
       { 
        $part = "part_1"; 
       } 
       } 

      foreach my $mirror (@NAMES) # fillters out unwanted links 
      { 
       if ($url =~/$mirror/i) 
       { 
        $hash{$Title}{$mirror}{$part}=$url; 
       } 
      } 
      } 
     } 

for my $Title (sort keys %hash) 
    { 
    for my $Host (sort keys %{$hash{$Title}}) 
     { 

      for my $part (sort keys %{$hash{$Title}{$Host}}) 
      { 

       my $url = $hash{$Title}{$Host}{$part}; 
       print "$Title,$url\n"; 
      } 
     } 
    }  
+4

它好得多解析,並提取使用專用HTML解析器從HTML數據如http://search.cpan.org/dist/HTML-Parser/Parser.pm – Bitwise

+0

HTTP:/ /stackoverflow.com/a/1732454/1671032 – PSIAlt

回答