2015-09-07 104 views
0

我正在嘗試使用perl提取此html文件內的表。使用Perl解析HTML文件

我已經試過這樣:

my $te = HTML::TableExtract->new(); 
$te->parse_file($g_log); 
print "=====TE: $te ======\n"; 

輸出是:

HTML:TableExtract = Hash(0x266f5f) 

我試圖通過$ TE和沒有發現的迭代。任何人都可以指導下一步做什麼。我是新來的。

這是HTML文件:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:math="http://exslt.org/math" 
      xmlns:testng="http://testng.org"> 
     <head xmlns=""> 
      <title>TestNG Results</title> 
      <meta http-equiv="content-type" content="text/html; charset=utf-8"></meta> 
      <meta http-equiv="pragma" content="no-cache"></meta> 
      <meta http-equiv="cache-control" content="max-age=0"></meta> 
      <meta http-equiv="cache-control" content="no-cache"></meta> 
      <meta http-equiv="cache-control" content="no-store"></meta> 
      <LINK rel="stylesheet" href="style.css"></LINK> 
      <script type="text/javascript" src="main.js"></script> 
     </head> 
     <body> 
      <h2>Test suites overview</h2> 


<table width="100%"> 
       <tr> 
        <td align="center" id="chart-container"><script type="text/javascript"> 
              renderSvgEmbedTag(600, 200); 
             </script></td> 
       </tr> 
       </table> 

    </body> 
    </html> 

回答

2
#!/usr/bin/perl 
#use strict; 
use warnings; 
use HTML::TableExtract; 
my $filename = "testfile.html"; 
my $te = HTML::TableExtract->new(); 
$te->parse_file($filename); 
foreach $ts ($te->tables) { 
    print "Table found at ", join(',', $ts->coords), ":\n"; 
    foreach $row ($ts->rows) { 
     print " ", join(',', @$row), "\n"; 
    } 
} 

注意HTML::TableExtract也可以在'tree' mode被調用,其中所得HTML和提取表中HTML::Element樹結構進行編碼。

use HTML::TableExtract 'tree';

+0

我試過這個和散列顯示'_tables'=> {} – Virus

+1

我試過了這個書面代碼,它工作。確保HTML文件的路徑正確。 –

+0

加1好。 @ChankeyPathak – User

1

不知道你是想擺脫表的內容。但我會強烈推薦使用數據轉儲器來查看哈希值。

#!/usr/bin/perl 

use strict; 
use warnings; 
use HTML::TableExtract; 
use Data::Dumper; 

my $html = <<'EOT'; 
<html xmlns="http://w...content-available-to-author-only...3.org/1999/xhtml" xmlns:math="http://e...content-available-to-author-only...t.org/math" 
      xmlns:testng="http://t...content-available-to-author-only...g.org"> 
     <head xmlns=""> 
      <title>TestNG Results</title> 
      <meta http-equiv="content-type" content="text/html; charset=utf-8"></meta> 
      <meta http-equiv="pragma" content="no-cache"></meta> 
      <meta http-equiv="cache-control" content="max-age=0"></meta> 
      <meta http-equiv="cache-control" content="no-cache"></meta> 
      <meta http-equiv="cache-control" content="no-store"></meta> 
      <LINK rel="stylesheet" href="style.css"></LINK> 
      <script type="text/javascript" src="main.js"></script> 
     </head> 
     <body> 
      <h2>Test suites overview</h2> 


<table width="100%"> 
       <tr> 
        <td align="center" id="chart-container"><script type="text/javascript"> 
              renderSvgEmbedTag(600, 200); 
             </script></td> 
       </tr> 
       </table> 

    </table> 
    </body> 
    </html> 
EOT 

my $te = HTML::TableExtract->new(); 
$te->parse($html); 

print Dumper($te); 
+0

謝謝!我通過翻車機看到數據,我看到這個:'_tables'=> {} – Virus