2016-06-08 99 views
-2

我是Perl新手,但我試圖編寫一個程序將單個HTML文件分割爲多個HTML文件。使用perl分割html文件

#!/usr/bin/env perl 

use strict; 
#use warnings; 

my @file_names; 

## Read the list of file names 
open(my $fh, "$ARGV[0]"); 
while (<$fh>) { 
    chomp; #remove new line character from the end of the line 
    push @file_names, $_; 
} 

my $counter = 0; 
my ($file_name, $fn); 

## Read the input file 
open($fh, "$ARGV[1]"); 
while (<$fh>) { 

    ## If this is an opening class, open the next output file, 
    ## and set $counter to 1. 

    if (/ class="bch_ha"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    ## If this is a closing class, print the line and set $counter back to 0 

    if (/\n<p sourcepage="(\d+)" class="bch_ha"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bcesu_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<p sourcepage="(\d+)" class="bcekt_tt"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bcekt_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<p sourcepage="(\d+)" class="bcepq_tt"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bcepq_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<p sourcepage="(\d+)" class="bcecs_tt"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bcecs_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<p sourcepage="(\d+)" class="bceex_tt"/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    if (/ class="bceex_tt"/) { 
     $counter = 1; 
     $file_name = shift(@file_names); 
     open($fn, ">", "$file_name"); 

     #print "<html>\n<body>"; 
    } 

    if (/\n<\/body>\n<\/html>/) { 
     $counter = 0; 
     print $fn $_; 
     close($fn); 
    } 

    ## Print into the corresponding file handle if $counter is 1 

    print $fn $_ if $counter == 1 
} 

我需要添加更多的選項。代碼應該要求手動輸入分隔符,並且分割文件應該轉到文件夾名稱chapterxx。請幫助我在這

是啊請找到下面的HTML示例。

<!DOCTYPE html> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<meta charset="UTF-8" /> 
</head> 
<body> 
<p sourcepage="27" `class="bch_ha"`></p> 
<p sourcepage="26"  class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p> 
<p sourcepage="26"  class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p> 
<p sourcepage="26"  class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX% </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="26"  class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p> 
<p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
<p sourcepage="26"  class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b> </p> 
<p sourcepage="26"  class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="26"  class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="26"  class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
<p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
</body> 
</html> 

我只需要基於類class="bch_ha"的HTML拆分到下一class="bch_ha",譜寫reader_0.html命名爲新的HTML內容。文件名將像reader_1.html一樣增量。

+1

您不能註釋'use warnings'。這些消息表明代碼中的某些內容不太正確,並將它們關閉並不能解決問題! – Borodin

+1

這應該用適當的HTML解析器完成。請顯示原始HTML,以便我們能夠幫助您。如果它在線,那麼一個鏈接是好的 – Borodin

+0

HTML我不能分享,因爲它的保密官方的東西。我只需要通過使用類名稱將html文件拆分爲多個文件,您可以在上面的代碼中看到它。但這應該是動態的,我需要創建一個名稱爲輸入文件的目錄以及所有需要在文件夾中移動的已拆分html。 –

回答

0

也許這個例子會給你一個關於如何能夠完成你的程序的想法。

在本例中,重點是如何根據分隔符分割文件。

注意:只保存html正文。

#!/usr/bin/env perl 
# test.pl 

use strict; 
use warnings; 

my $file = './htmlInput.html'; # input file 
my $delim = 'class="bch_ha"'; # delimiter 
my $dir = 'chapter' . time; # folder with unix timestamp 

# mkdir returns 1 if success 
if (mkdir($dir, 0755)) { 
    print "INFO: Created folder $dir to collect files.\n"; 
} else { 
    die "Can't make folder $dir\n"; 
} 

# reader_x.html, x = [0..] 
my $reader = 'reader_0.html'; 

my $fh2; 
my $cnt = 0; 
my $delim_first_time = 1; 
open(my $fh, "<", $file) or die "Can't open and read $file: $!"; # read file 
while (my $li = <$fh>) { 
    last if ($li =~ /<\/body>/); # quit the while loop 

    if ($delim_first_time && $li =~ /$delim/) { 
     open($fh2, ">", "./$dir/$reader") or die "Can't write to $reader : $!"; # write 
     $delim_first_time = 0; 
    } elsif ($li =~ /$delim/) { 
     close($fh2); 
     $cnt++; 
     $reader =~ s/[0-9]+/$cnt/; # reader_0.html -> reader_1.html 
     open($fh2, ">", "./$dir/$reader") or die "Can't write to $reader : $!"; # write 
    } 
    print $fh2 $li if !$delim_first_time; 
} 
close($fh); 
close($fh2); 

# output: 
# [~]$ ./test.pl 
# INFO: Created folder chapter1465642603 to collect files. 
# [~]$ ls chapter1465642603 
# reader_0.html reader_1.html 
# [~]$ cat chapter1465642603/reader_0.html 
# <p sourcepage="27" `class="bch_ha"`></p> 
# <p sourcepage="26"  class="bopob_ct">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</p> 
# <p sourcepage="26"  class="bopob_cr">Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <i>Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p> 
# <p sourcepage="26"  class="bch_nmword">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bch_nm">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bch_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="26" class="bopob_tt">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx% <b>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX% </b>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="26"  class="bopob_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</i></p> 
# <p sourcepage="26" class="bopob_lbfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26" class="bopob_lb">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# [~]$ 
# [~]$ cat chapter1465642603/reader_1.html 
# <p sourcepage="26" class="bch_ha">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b></p> 
# <p sourcepage="26"  class="bopob_lblast">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</b> </p> 
# <p sourcepage="26"  class="bopcs_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="26"  class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="26"  class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="27" class="bopcs_tx">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%<span class="sup">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</sup>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# <p sourcepage="27" class="bch_txfirst">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%</p> 
# [~]$