Perl將文本文件分割成塊

我有一個由數千篇文章組成的大文本文件，我試圖將其分割爲單獨的文件 - 每篇文章對應於我希望保存爲article_1，article_2等的每篇文章每篇文章都以包含單詞/ DOCUMENTS /的行開頭。我對perl完全陌生，任何見解都會非常棒！（甚至在良好的doc網站上提供建議）。非常感謝。到目前爲止，我曾嘗試樣子：Perl將文本文件分割成塊

#!/usr/bin/perl 
use warnings; 
use strict; 

my $id = 0; 
my $source = "2010_FTOL_GRbis.txt"; 
my $destination = "file$id.txt"; 

open IN, $source or die "can t read $source: $!\n"; 

while (<IN>) 
    { 
    { 
     open OUT, ">$destination" or die "can t write $destination: $!\n"; 
     if (/DOCUMENTS/) 
     { 
     close OUT ; 
     $id++; 
     } 
    } 
    } 
close IN;

來源

2012-07-30 user1562471

我試着重新縮進你粘貼的東西，正確的，我看到一對多餘的'{}'s。你確定你粘貼了整個東西嗎？另外，下次在問題/答案中粘貼代碼時，請使用「{}」按鈕。 – ArjunShankar 2012-07-30 09:52:34

你看過Programming Perl？這是開始的最好的書！

我不明白你在做什麼。我假設你有文章，並有文章，並希望獲得單獨文件中的所有文章。

use warnings; 
use strict; 
use autodie qw(:all); 

my $id   = 0; 
my $source  = "2010_FTOL_GRbis.txt"; 
my $destination = "file$id.txt"; 

open my $IN, '<', $source; 
#open first file 
open my $OUT, '>', $destination; 

while (<$IN>) { 
    chomp; # kill \n at the end 
    if ($_ eq '/DOCUMENTS/') { # not sure, am i right here or what you looking for 
     close OUT; 
     $id++; 
     $destination = "file$id.txt"; 
     open my $OUT, '>', $destination; 
    } else { 
     print {$OUT} $_, "\n";  # print into file with $id name (as you open above) 
    } 
} 
close $IN;

來源

2012-07-30 10:02:05 gaussblurinc

你可以擺脫「我的$目的地」的第一項任務。另外我相信OP意味着字符串'/ DOCUMENTS /'（如文件系統中的一段路徑）是新文章標題行的一部分，所以你應該說'if（m {/DOCUMENTS /}）{'。 – simbabque 2012-07-30 11:19:37

您正確的「DOCUMENT」是每篇文章的標題的一部分。但上面的腳本不起作用，循環不起作用我只是用我的所有文章裏面的file0.txt。 – user1562471 2012-07-30 17:24:48

假設/DOCUMENTS/自動出現在一行。因此你可以使記錄爲分隔符。

use English  qw<$RS>; 
use File::Slurp qw<write_file>; 
my $id  = 0; 
my $source = "2010_FTOL_GRbis.txt"; 

{ local $RS = "\n/DOCUMENTS/\n"; 
    open my $in, $source or die "can t read $source: $!\n"; 
    while (<$in>) { 
     chomp; # removes the line "\n/DOCUMENTS/\n" 
     write_file('file' . (++$id) . '.txt', $_); 
    } 
    # being scoped by the surrounding brackets (my "local block"), 
    close $in; # an explicit close is not necessary 
}

注：

use English聲明全局變量$RS。這個「雜亂的名字」是$/。見perldoc perlvar
行分隔符是默認記錄分隔符。也就是說，文件讀取的標準單位是記錄。這只是，由默認，一個「線」。
正如您在鏈接文檔中發現的那樣，$ RS只需要文字字符串。因此，使用這樣的想法，即文章之間的劃分是'/DOCUMENTS/'本身在一行上，我指定newline + '/DOCUMENTS/' + newline。如果這是發生在某行的某個路徑的一部分，那麼該特定值將不適用於記錄分隔符。

來源

2012-07-30 13:00:04 Axeman

非常感謝您的回覆。你能解釋記錄分隔符是如何工作的嗎？我應該首先聲明變量RS嗎？ – user1562471 2012-07-30 17:20:27

@ user1562471，請參閱Notes部分，剛添加。 – Axeman 2012-07-30 18:23:02

再次感謝。文檔本身不會出現在一行上，而是表達式的一部分，例如「150個文檔中的1個」，因此它不會用作記錄分隔符。但我會嘗試找到另一個分隔符，它是一個整行。 – user1562471 2012-07-30 19:52:41

Perl將文本文件分割成塊

回答

相關問題