2014-09-22 131 views
0

我想從文件中刪除尾隨的空行(如果有的話)。目前我通過在內存中讀取它,刪除那裏的空白行,並覆蓋它。該文件很大,但是(30000多行和長行),這需要2-3秒。一行一行地讀取文件,但是反過來(最後一行先,然後是最後一行等)

所以我想逐行讀取文件,但是向後讀,直到我到達第一個非空行。也就是說,我從最後一行開始,然後是最後一行,等等,然後我會截斷它,而不是覆蓋它。

什麼是最好的方式讀取它反向?現在我正在考慮讀取64k的塊,然後以字符爲單位循環遍歷字符串,直到獲得一行,然後當我用完64k,讀取另一個64k並預先安裝它們, 等等。

我假設沒有標準函數或庫以相反順序讀取?

+0

您預計會有多少空行?成千上萬的?每一個可能只是一個單行換行符,所以我認爲即使是64k字節也可能會過度殺傷。 – Blckknght 2014-09-22 08:51:46

+0

它可能是,但與將所有內容全部讀入內存相比,它仍然是一個非常激烈的優化。 – sashoalm 2014-09-22 08:53:09

+0

有沒有內置的功能來做到這一點,但我不得不爲此編寫一個類。我會看看我能否獲得發佈權限。 – 2014-09-22 08:59:19

回答

2

這是一些代碼,我在別處找到了修改後的版本(這裏大概在計算器上,其實...) - 我已經提取的手柄向後讀取兩個關鍵方法。

reversed_blocks迭代器以您喜歡的大小塊向後讀取文件,reversed_lines迭代器將塊拆分爲行,保存第一個塊;如果下一個塊以換行符結束,則將其作爲完整行返回,如果不是,則將已保存的部分行追加到新塊的最後一行,從而完成在塊邊界上拆分的行。

所有的狀態都由Python的迭代器機制來維護,所以我們不必在任何地方存儲狀態;這也意味着如果需要的話,可以一次向後讀取多個文件,因爲狀態綁定到迭代器。

def reversed_lines(self, file): 
    "Generate the lines of file in reverse order." 
    newline_char_set = set(['\r', '\n']) 
    tail = "" 
    for block in self.reversed_blocks(file): 
     if block is not None and len(block)>0: 
      # First split the whole block into lines and reverse the list 
      reversed_lines = block.splitlines() 
      reversed_lines.reverse() 

      # If the last char of the block is not a newline, then the last line 
      # crosses a block boundary, and the tail (possible partial line from 
      # the previous block) should be added to it. 
      if block[-1] not in newline_char_set: 
       reversed_lines[0] = reversed_lines[0] + tail 

      # Otherwise, the block ended on a line boundary, and the tail is a 
      # complete line itself. 
      elif len(tail)>0: 
       reversed_lines.insert(0,tail) 

      # Within the current block, we can't tell if the first line is complete 
      # or not, so we extract it and save it for the next go-round with a new 
      # block. We yield instead of returning so all the internal state of this 
      # iteration is preserved (how many lines returned, current tail, etc.). 
      tail = reversed_lines.pop() 

      for reversed_line in reversed_lines: 
       yield reversed_line 

    # We're out of blocks now; if there's a tail left over from the last block we read, 
    # it's the very first line in the file. Yield that and we're done. 
    if len(tail)>0: 
     yield tail 

def reversed_blocks(self, file, blocksize=4096): 
    "Generate blocks of file's contents in reverse order." 

    # Jump to the end of the file, and save the file offset. 
    file.seek(0, os.SEEK_END) 
    here = file.tell() 

    # When the file offset reaches zero, we've read the whole file. 
    while 0 < here: 
     # Compute how far back we can step; either there's at least one 
     # full block left, or we've gotten close enough to the start that 
     # we'll read the whole file. 
     delta = min(blocksize, here) 

     # Back up to there and read the block; we yield it so that the 
     # variable containing the file offset is retained. 
     file.seek(here - delta, os.SEEK_SET) 
     yield file.read(delta) 

     # Move the pointer back by the amount we just handed out. If we've 
     # read the last block, "here" will now be zero. 
     here -= delta 

reversed_lines是一個迭代器,讓你在一個循環中運行它:

for line in self.reversed_lines(fh): 
    do_something_with_the_line(line) 

的意見可能是多餘的,但在我工作了迭代器如何做他們的工作,他們對我很有用。

0
with open(filename) as f: 
    size = os.stat(filename).st_size 
    f.seek(size - 4096) 
    block = f.read(4096) 
    # Find amount to truncate 
    f.truncate(...) 
+0

順便說一句,你可以使用'f.seek(-4096,2)'。 – sashoalm 2014-09-22 09:02:07

+0

所以你確實知道如何從最後讀取文件?或者我誤解了你的問題?你可以通過執行'4096 - len(block.rstrip())'來輕鬆截取數據。 – filmor 2014-09-22 10:45:46

+0

這給你反向的塊,但不是線。查看我在下面的基於迭代器的版本,尋找一個很好的技巧來跟蹤塊和行偏移量,因此您不必擔心自己維護它們。 – 2014-09-22 18:14:36