如何使用不同的行分隔符讀取大文件？

我有兩個非常大的XML文件，它們有不同的行結尾。文件A在每個XML記錄的末尾有CR LF。文件B在每個XML記錄的末尾只有CR。如何使用不同的行分隔符讀取大文件？

爲了正確讀取文件B，我需要將內置Perl變量$ /設置爲「\ r」。但是，如果我使用與文件A相同的腳本，腳本不會讀取文件中的每一行，而是將其作爲單行讀取。

如何使腳本與具有各種行結束分隔符的文本文件兼容？在下面的代碼中，腳本正在讀取XML數據，然後使用正則表達式根據特定XML標記記錄結束標記（如< \ record>）拆分記錄。最後它將請求的記錄寫入文件。

open my $file_handle, '+<', $inputFile or die $!; 
local $/ = "\r"; 
while(my $line = <$file_handle>) { #read file line-by-line. Does not load whole file into memory. 
    $current_line = $line; 

    if ($spliceAmount > $recordCounter) { #if the splice amount hasn't been reached yet 
     push (@setofRecords,$current_line); #start adding each line to the set of records array 
     if ($current_line =~ m|$recordSeparator|) { #check for the node to splice on 
      $recordCounter ++; #if the record separator was found (end of that record) then increment the record counter 
     } 
    } 
    #don't close the file because we need to read the last line 

} 
$current_line =~/(\<\/\w+\>$)/; 
$endTag = $1; 
print "\n\n"; 
print "End Tag: $endTag \n\n"; 

close $file_handle;

來源

2013-06-03 astra

由於您認爲XML文件在合理的位置甚至存在換行符，您將受到懲罰。 –

這意味着要分發，所以我不想用模塊來解決這個問題。這是否意味着我不得不重新編寫Perl以外的其他語言，以便更好地支持XML解析？ – astra

如果文件不是太大的內存來保存，可以啜了整個事情變成一個標量，它自己與合適的柔性正則表達式拆分爲正確的線路。例如，

local $/ = undef; 
my $data = <$file_handle>; 
my @lines = split /(?>\r\n)|(?>\r)|(?>\n)/, $data; 
foreach my $line (@lines) { 
    ... 
}

使用前瞻斷言(?>...)保存結束行的字符，例如定期<>操作一樣。如果你只是想叮them他們，你可以通過/\r\n|\r|\n/來代替split來節省一步。

來源

2013-06-03 20:26:32 mob

儘管您可能不需要它，理論上來說，要解析.xml，您應該使用xml解析器。我推薦XML::LibXM或者也許從XML::Simple開始。

來源

2013-06-03 20:29:33

是的我可以使用它，但這個腳本是爲了共享，我寧願不需要其他人下載模塊來運行它。 – astra

如何使用不同的行分隔符讀取大文件？

回答

相關問題