Python - 我可以將UTF8 BOM添加到文件而無需打開它嗎？

如何在不打開（）的情況下將utf8-bom添加到文本文件？Python - 我可以將UTF8 BOM添加到文件而無需打開它嗎？

從理論上講，我們只需要將utf8-bom添加到文件的開頭，我們不需要讀入所有的內容？

2016-08-23 minion

向文件的開頭添加內容涉及重寫整個文件，只能追加到文件的末尾，而不是在某處插入內容。如果不打開它，你也無法修改文件。所以不，你想要的是不可能的。 – dhke

@dhke'沒有打開它'確實是不準確的。我有很多大文件，比如1千兆字節。什麼是添加utf8-bom的最佳方式？ – minion

@minion：沒有辦法可以避免讀取和寫入完整的1 GB。您的唯一選擇是臨時文件（具有原子性和安全性，但具有更高的臨時磁盤空間要求）或適當的修改（通常較慢，如果中途中斷，但只需要最小的額外磁盤空間，可能會損壞數據）。 – ShadowRanger

您需要讀取數據，因爲您需要移動所有數據以爲BOM留出空間。文件不能只添加任意數據。在地方做的不僅僅是編寫與BOM，再加上原有的數據，然後替換原來的文件一個新的文件更難，所以最簡單的解決方案通常是這樣的：

import os 
import shutil 

from os.path import dirname, realpath 
from tempfile import NamedTemporaryFile 

infile = ... 

# Open original file as UTF-8 and tempfile in same directory to add sig 
indir = dirname(realpath(infile)) 
with NamedTemporaryFile(dir=indir, mode='w', encoding='utf-8-sig') as tf: 
    with open(infile, encoding='utf-8') as f: 
     # Copy from one file to the other by blocks 
     # (avoids memory use of slurping whole file at once) 
     shutil.copyfileobj(f, tf) 

    # Optional: Replicate metadata of original file 
    tf.flush() 
    shutil.copystat(f.name, tf.name) # Replicate permissions of original file 

    # Atomically replace original file with BOM marked file 
    os.replace(tf.name, f.name) 

    # Don't try to delete temp file if everything worked 
    tf.delete = False

這也驗證輸入文件實際上是UTF-8的副作用，原始文件從不存在不一致的狀態;它可能是舊數據或新數據，而不是中間工作副本。

如果您的文件很大並且磁盤空間有限（因此您不能一次在磁盤上有兩個副本），那麼原位突變可能是可以接受的。要做到這一點最簡單的方法是mmap模塊，簡化了大大周圍移動數據與使用就地文件對象操作的過程：如果您需要就地更新

import codecs 
import mmap 

# Open file for read and write and then immediately map the whole file for write 
with open(infile, 'r+b') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) as mm: 
    origsize = mm.size() 
    bomlen = len(codecs.BOM_UTF8) 
    # Allocate additional space for BOM 
    mm.resize(origsize+bomlen) 

    # Copy file contents down to make room for BOM 
    # This reads and writes the whole file, and is unavoidable 
    mm.move(bomlen, 0, origsize) 

    # Insert the BOM before the shifted data 
    mm[:bomlen] = codecs.BOM_UTF8

來源

2016-08-23 07:09:38 ShadowRanger

，像

def add_bom(fname, bom=None, buf_size=None): 
    bom = bom or BOM 
    buf_size = buf_size or max(resource.getpagesize(), len(bom)) 
    buf = bytearray(buf_size) 
    with open(fname, 'rb', 0) as in_fd, open(fname, 'rb+', 0) as out_fd: 
     # we cannot just just read until eof, because we 
     # will be writing to that very same file, extending it. 
     out_fd.seek(0, 2) 
     nbytes = out_fd.tell() 
     out_fd.seek(0) 
     # Actually, we want to pass buf[0:n_bytes], but 
     # that doesn't result in in-place updates. 
     in_bytes = in_fd.readinto(buf) 
     if in_bytes < len(bom) or not buf.startswith(bom): 
      # don't write the BOM if it's already there 
      out_fd.write(bom) 
     while nbytes > 0: 
      # if we still need to write data, do so. 
      # but only write as much data as we need 
      out_fd.write(buffer(buf, 0, min(in_bytes, nbytes))) 
      nbytes -= in_bytes 
      in_bytes = in_fd.readinto(buf)

應該這樣做。

正如你所看到的，就地更新，都有點finnicky，因爲你是

將數據寫入到你剛剛從閱讀的地方。讀取必須始終保持在寫入之前，否則您將覆蓋尚未處理的數據。
擴展您正在閱讀的文件，因此閱讀直到EOF不起作用。

此外，這可能會使文件處於不一致的狀態。如果可能的話，臨時複製 - >將臨時移動到原始方法是非常優選的。

來源

2016-08-23 07:52:32 dhke

我在[我的回答]（http://stackoverflow.com/a/39094533/364696）中添加了一個替代就地解決方案，使用'mmap'來簡化工作。我發現它比嘗試使用文件對象操作更容易。 – ShadowRanger

@ShadowRanger不錯，我也考慮過這個，你看看'cp'的源代碼，你會發現它也使用了分塊的'mmap'來避免大型文件的虛擬機癱瘓。在64位操作系統上沒有這麼多問題，但如果有一個3 + GB文件，你將無法在32位機器上「映射」它。 – dhke

我假設它正在使用分塊'mmap'來避免虛擬機內存限制，而不是抖動，但是，它在超過（通常）1.5 GB左右的32位計算機上無法擴展。解決方案：現在是2016年，運行64位操作系統和Python安裝。 :-) – ShadowRanger

Python - 我可以將UTF8 BOM添加到文件而無需打開它嗎？

回答

相關問題