閱讀一個非常大的單行txt文件，並拆分它

我有以下問題：我有一個近500mb大的文件。它的文本全部在一行中。該文本是分隔與虛擬線結束，其所謂ROW_DEL並在文中是這樣的：閱讀一個非常大的單行txt文件，並拆分它

this is a line ROW_DEL and this is a line

現在我需要做下面，我想這樣我得到一個文件到這個文件分割成其行像這樣：

this is a line 
and this is a line

的問題，即使我用Windows的文本編輯器打開它，它breakes因爲文件是大。

是否有可能像我用C＃，Java或Python提到的那樣拆分這個文件？什麼是最好的靈魂，不要過度殺死我的CPU。

來源

2013-05-16 gurehbgui

你不能使用'sed'或任何腳本工具嗎？ – harsh

你爲什麼稱ROW_DEL爲虛擬結局？ ROW_DEL是文件中是否有連續的字符？我想，你的問題很容易解決，但是這一點困擾了我。 – eyquem

您可以嘗試讀取固定大小的塊中的文件，查看StreamReader文檔中的「read」文檔（http://docs.python.org/release/2.4/lib/stream-reader-objects.html） –

其實500mb的文字並不大，只是記事本太爛了。你可能因爲你使用的是Windows不具備的sed可用，但至少嘗試蟒蛇天真的解決方案，我認爲這將很好地工作：

import os 
with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out: 
    f_out.write(f_in.read().replace('ROW_DEL ', os.linesep))

來源

2013-05-16 09:39:47 wim

+1不知道爲什麼這是downvoted它實際上即時工作，並回答了這個問題。也許''ROW_DEL''應該是''ROW_DEL'' – jamylak

它運行在<3sek並且所有的工作都完成了:)非常感謝你 – gurehbgui

我認爲你的答案在閱讀更大的文件時可能是致命的。如果您通過字符讀取填充長度等於搜索的子字符串的向量，彈出窗口，推回和比較，則可以使用更耗時但更安全的方法。 –

以cun爲單位閱讀此文件，例如在c＃中使用StreamReader.ReadBlock。您可以設置要讀取的最大字符數。

對於每個零碎塊，您可以將ROW_DEL替換爲\r\n並將其附加到新文件中。

只記得通過你剛剛閱讀的字符數來增加當前索引。

來源

2013-05-16 09:28:40

如果ROW_DEL被分成兩塊，該怎麼辦？ – I4V

正確..在這種情況下，只需檢查您的塊最後的字母是否是ROW_DEL的一部分，並在需要時讀取更多字符。你可以完全控制你想閱讀的數量，所以它不應該是一個問題。 –

這裏是我的解決方案。
容易在原則（ŁukaszW.pl給了它），但如果想要照顧特殊情況（ŁukaszW.pl沒有），不容易編碼。

特殊情況是當分隔符ROW_DEL在兩個讀取塊（如I4V指出的）中被拆分時，更加微妙的是如果有兩個連續的ROW_DEL，其中第二個被分成兩個讀取塊。

由於ROW_DEL長於任何可能的換行符（'\r'，'\n'，'\r\n'），因此它可以在文件中由OS使用的換行符替換。這就是我選擇自己重寫文件的原因。
爲此我使用模式'r+'，它不創建新文件。
使用二進制模式'b'也是絕對必需的。

的原理是讀一個塊（在現實生活中其大小將是262144例如）和X附加字符，wher X是分離器的長度-1。
然後檢查分隔符是否出現在塊+ x字符的末尾。
如果存在或不存在，在執行ROW_DEL轉換之前該塊被縮短或者被縮短，並被重寫。

裸體代碼：

text = ('The hospital roommate of a man infected ROW_DEL' 
     'with novel coronavirus (NCoV)ROW_DEL' 
     '—a SARS-related virus first identified ROW_DELROW_DEL' 
     'last year and already linked to 18 deaths—ROW_DEL' 
     'has contracted the illness himself, ROW_DEL' 
     'intensifying concerns about the ROW_DEL' 
     "virus's ability to spread ROW_DEL" 
     'from person to person.') 

with open('eessaa.txt','w') as f: 
    f.write(text) 

with open('eessaa.txt','rb') as f: 
    ch = f.read() 
    print ch.replace('ROW_DEL','ROW_DEL\n') 
    print '\nlength of the text : %d chars\n' % len(text) 

#========================================== 

from os.path import getsize 
from os import fsync,linesep 

def rewrite(whichfile,sep,chunk_length,OSeol=linesep): 
    if chunk_length<len(sep): 
     print 'Length of second argument, %d , is '\ 
       'the minimum value for the third argument'\ 
       % len(sep) 
     return 

    x = len(sep)-1 
    x2 = 2*x 
    file_length = getsize(whichfile) 
    with open(whichfile,'rb+') as fR,\ 
     open(whichfile,'rb+') as fW: 
     while True: 
      chunk = fR.read(chunk_length) 
      pch = fR.tell() 
      twelve = chunk[-x:] + fR.read(x) 
      ptw = fR.tell() 

      if sep in twelve: 
       pt = twelve.find(sep) 
       m = ("\n !! %r is " 
        "at position %d in twelve !!" % (sep,pt)) 
       y = chunk[0:-x+pt].replace(sep,OSeol) 
      else: 
       pt = x 
       m = '' 
       y = chunk.replace(sep,OSeol) 

      pos = fW.tell() 
      fW.write(y) 
      fW.flush() 
      fsync(fW.fileno()) 

      if fR.tell()<file_length: 
       fR.seek(-x2+pt,1) 
      else: 
       fW.truncate() 
       break 

rewrite('eessaa.txt','ROW_DEL',14) 

with open('eessaa.txt','rb') as f: 
    ch = f.read() 
    print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1)) 
    print '\nlength of the text : %d chars\n' % len(ch)

遵循執行，這裏的，一直以來將消息輸出另一個代碼：

text = ('The hospital roommate of a man infected ROW_DEL' 
     'with novel coronavirus (NCoV)ROW_DEL' 
     '—a SARS-related virus first identified ROW_DELROW_DEL' 
     'last year and already linked to 18 deaths—ROW_DEL' 
     'has contracted the illness himself, ROW_DEL' 
     'intensifying concerns about the ROW_DEL' 
     "virus's ability to spread ROW_DEL" 
     'from person to person.') 

with open('eessaa.txt','w') as f: 
    f.write(text) 

with open('eessaa.txt','rb') as f: 
    ch = f.read() 
    print ch.replace('ROW_DEL','ROW_DEL\n') 
    print '\nlength of the text : %d chars\n' % len(text) 

#========================================== 

from os.path import getsize 
from os import fsync,linesep 

def rewrite(whichfile,sep,chunk_length,OSeol=linesep): 
    if chunk_length<len(sep): 
     print 'Length of second argument, %d , is '\ 
       'the minimum value for the third argument'\ 
       % len(sep) 
     return 

    x = len(sep)-1 
    x2 = 2*x 
    file_length = getsize(whichfile) 
    with open(whichfile,'rb+') as fR,\ 
     open(whichfile,'rb+') as fW: 
     while True: 
      chunk = fR.read(chunk_length) 
      pch = fR.tell() 
      twelve = chunk[-x:] + fR.read(x) 
      ptw = fR.tell() 

      if sep in twelve: 
       pt = twelve.find(sep) 
       m = ("\n !! %r is " 
        "at position %d in twelve !!" % (sep,pt)) 
       y = chunk[0:-x+pt].replace(sep,OSeol) 
      else: 
       pt = x 
       m = '' 
       y = chunk.replace(sep,OSeol) 
      print ('chunk == %r %d chars\n' 
        ' -> fR now at position %d\n' 
        'twelve == %r %d chars %s\n' 
        ' -> fR now at position %d' 
        % (chunk ,len(chunk),  pch, 
         twelve,len(twelve),m, ptw)) 

      pos = fW.tell() 
      fW.write(y) 
      fW.flush() 
      fsync(fW.fileno()) 
      print ('   %r %d long\n' 
        ' has been written from position %d\n' 
        ' => fW now at position %d' 
        % (y,len(y),pos,fW.tell())) 

      if fR.tell()<file_length: 
       fR.seek(-x2+pt,1) 
       print ' -> fR moved %d characters back to position %d'\ 
         % (x2-pt,fR.tell()) 
      else: 
       print (" => fR is at position %d == file's size\n" 
         ' File has thoroughly been read' 
         % fR.tell()) 
       fW.truncate() 
       break 

      raw_input('\npress any key to continue') 


rewrite('eessaa.txt','ROW_DEL',14) 

with open('eessaa.txt','rb') as f: 
    ch = f.read() 
    print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1)) 
    print '\nlength of the text : %d chars\n' % len(ch)

有一個在塊的兩端在治療上有一些細微之處爲了檢測ROW_DEL是否跨越兩個塊並且是否有兩個ROW_DEL連續。這就是爲什麼我花了很長時間來發布我的解決方案：我終於被迫寫fR.seek(-x2+pt,1)，不僅fR.seek(-2*x,1)或fR.seek(-x,1)根據sep跨越或不（2 * x代碼是x2，其中ROW_DEL x和x2是6和12）。任何對此感興趣的人都可以通過更改if 'ROW_DEL' is in twelve中的代碼來檢查它。

來源

2013-05-16 15:42:24 eyquem

閱讀一個非常大的單行txt文件，並拆分它

回答

相關問題