2013-02-25 42 views
0

我製作了一個腳本,它將從HTML <TR>標籤中提取所有行數據。我的HTML頁面上有30個HTML <TR>標籤。基於計數,我的代碼將獲取特定的行數據。假設我需要第5個<tr>...</tr>中的數據,那麼我的條件是if(count =5) {(go inside and get that data)}用HTML提取幾行:: TableExtract

但是我的問題在於我需要一次選擇一行數據。假設我需要第5,6和14行的數據。

請問您能幫我整理一下嗎?

$te = new HTML::TableExtract(count => 0); 
$te->parse($content); 
# Examine all matching tables 
foreach $ts ($te->table_states) { 
    #print "Table (", join(',', $ts->coords), "):\n"; 
    $cnt = 1; 
    foreach $row($ts->rows) { 
     # print " ---- Printing Row $cnt ----\n"; 
     $PrintLine= join("\t", @$row); 
     @RowData=split(/\t/,$PrintLine); 
     $PrintLine =~ s/\r//ig; 
     $PrintLine =~ s/\t//ig; 
     $cnt = $cnt + 1; 
     # if ($PrintLine =~ /Site ID/ig || $PrintLine =~ /Site name/ig){print " Intrest $PrintLine $cnt =====================\n"}; 
     if ($cnt == 14) { 
      $arraycnt = 1; 
      my $SiteID=""; 
      my $SiteName=""; 
      foreach (@RowData) { 
       # print " Array element $arraycnt\n"; 
       chomp; 
       $_ =~ s/\r//ig; 
       $_ =~ s/[\xC3\xA1\xC3\xA0\xC3\xA2\xC3\xA3]//ig; 
       if ($arraycnt== 17) { $SiteID= $_;} 
       if ($arraycnt== 39) { $SiteName= $_;} 
        $arraycnt = $arraycnt + 1; 
      } 
      #$PrintLineFinal = $BridgeCase."\t".$PrintLine; 
      $PrintLineFinal = $BridgeCase."\t".$SiteID."\t".$SiteName; 
      #print "$PrintLineFinal\n"; 
      print MYFILE2 "$PrintLineFinal\n";   
      last; 
     }  
    } 
} 
+0

得益於正確縮進代碼。 – 2013-02-25 20:11:46

回答

0

幾點建議:

始終:

use strict; 
use warnings; 

這將迫使你與my來聲明變量。例如

foreach my $ts ($te->table_states) { 
    my $cnt = 1; 

warnings將讓你瞭解最愚蠢的錯誤。strict會要求您使用在某些情況下,更好的做法,防止錯誤)。

在幾個地方,您在使用數組時會使用自己的計數器變量。你不需要這樣做。相反,直接獲取你想要的數組元素。例如$array[3]得到第三個元素。

Perl還允許數組切片獲取所需的某些元素。 @array[4,5,13]獲得數組的第五,第六和第十四個元素。你可以用它來處理,而不是通過所有這些循環只有你想要的行:

my @rows = $ts->rows; 
foreach my $row (@rows[4,5,13]) #process only the 5th, 6th, and 14th rows. 
{ 
    ... 
} 

這裏是同樣的事情的快捷版本,使用匿名數組:

foreach my $row (@{[$ts->rows]}[4,5,13]) 

而且,也許你要定義你想要在你的代碼的其他地方行:

my @wanted_rows = (4,5,13); 
... 
foreach my $row (@{[$ts->rows]}[@wanted_rows]) 

此代碼是相當混亂:

$PrintLine= join("\t", @$row); 
@RowData=split(/\t/,$PrintLine); 
$PrintLine =~ s/\r//ig; 
$PrintLine =~ s/\t//ig; 

首先,您要加入一個包含製表符的數組,然後您將分裂剛剛加入的數組以重新獲取數組。然後,您仍然從行中刪除所有制表符。

我建議你擺脫所有的代碼。無論何時需要陣列,只需使用@$row,而不是複製它。如果您需要打印調試數組(這是所有你似乎與$PrintLine做,你可以直接打印一個數組:

print @$row; #print an array, nothing between each element. 
print "@$row"; #print an array with spaces between each element. 

伴隨着這些變化,你的代碼將是這樣的:

use strict; 
use warnings; 

my @wanted_rows = (4,5,13); 

my $te = new HTML::TableExtract(count => 0); 

$te->parse($content); 
# Examine all matching tables 
foreach my $ts ($te->table_states) { 
    foreach my $row (@{[$ts->rows]}[@wanted_rows]) { 

     s/[\xC3\xA1\xC3\xA0\xC3\xA2\xC3\xA3\r\n]//ig for (@$row); 

     my $SiteID = $$row[16] // ''; #set to empty strings if not defined. 
     my $SiteName = $$row[38] // ''; 
     print MYFILE2 $BridgeCase."\t".$SiteID."\t".$SiteName; 
    } 
} 
0

你可以訪問的結果是這樣的:

foreach $ts ($te->table_states) { 
    #you need 14th rows 
    #my 14throws = $ts->rows->[13];#starting with zero! 
    #17th col from the 14th row 
    #my $17colfrom14throws = $ts->rows->[13]->[16]; 
    my $SiteName = $ts->rows->[13]->[38]; 
    my $SiteID = $ts->rows->[13]->[16]; 
    my $PrintLineFinal = $BridgeCase."\t".$SiteID."\t".$SiteName; 
    print MYFILE2 "$PrintLineFinal\n";  
}