使用python從大型二進制文件中刪除字符序列

我想從python中的二進制文件中修剪相同值的長序列。一個簡單的方法是簡單地讀取文件並使用re.sub替換不需要的序列。這當然不適用於大的二進制文件。它可以做像numpy的東西嗎？使用python從大型二進制文件中刪除字符序列

2008-10-21 bluegray

如果你沒有內存做open("big.file").read()，那麼numpy不會真的幫助..它使用與python變量相同的內存（如果你有1GB的RAM，你只能將1GB的數據加載到numpy中）

解決方法很簡單 - 以大塊讀取文件.. f = open("big.file", "rb")，然後執行一系列f.read(500)，刪除序列並將其寫回到另一個文件對象。幾乎你怎麼辦文件中讀取/編寫C ..

問題則是，如果你想你的更換模式。例如：

target_seq = "567" 
input_file = "1234567890" 

target_seq.read(5) # reads 12345, doesn't contain 567 
target_seq.read(5) # reads 67890, doesn't contain 567

顯而易見的解決方案是在第一次啓動字符在文件中，檢查len(target_seq)個字符，然後前進一個字符，再次檢查。

例如（僞代碼！）：

while cur_data != "": 
    seek_start = 0 
    chunk_size = len(target_seq) 

    input_file.seek(offset = seek_start, whence = 1) #whence=1 means seek from start of file (0 + offset) 
    cur_data = input_file.read(chunk_size) # reads 123 
    if target_seq == cur_data: 
     # Found it! 
     out_file.write("replacement_string") 
    else: 
     # not it, shove it in the new file 
     out_file.write(cur_data) 
    seek_start += 1

這不正是最有效的方式，但它會工作，並且不需要保留文件的副本存儲在存儲器（或兩個）。

來源

2008-10-21 13:18:37 dbr

謝謝，這有很大幫助。我希望numpy會對大文件進行一些自動內存管理 - 我對它不太熟悉。 – bluegray 2008-10-24 12:00:08

如果兩個副本適合內存，那麼您可以輕鬆地進行復制。第二個副本是壓縮版本。當然，你可以使用numpy，但你也可以使用array包。另外，你可以將你的大的二進制對象當作一串字節來處理，並直接對其進行處理。

聽起來像你的文件可能是真的是大，你不能適應兩個副本到內存中。（你沒有提供很多細節，所以這只是一個猜測。）你必須做大塊壓縮。你會讀一塊，在這個塊上做一些處理並寫出來。再次，numpy，數組或簡單的字節串將可以正常工作。

來源

2008-10-21 10:33:57

你需要讓你的問題更加精確。你知道你想提前修剪的價值嗎？

假設你做什麼，我可能會搜索使用subprocess運行「fgrep -o -b <search string>」，然後更改使用Python file對象的seek，read和write方法文件的有關章節的匹配部分。

來源

2008-10-21 12:48:00

dbr的解決方案是一個好主意，但有點過於複雜，你必須做的是在你讀下一個塊之前，將文件指針倒回到你正在搜索的序列的長度。

def ReplaceSequence(inFilename, outFilename, oldSeq, newSeq): 
inputFile = open(inFilename, "rb") 
outputFile = open(outFilename, "wb") 

data = "" 
chunk = 1024 

while 1: 
     data = inputFile.read(chunk) 
     data = data.replace(oldSeq, newSeq) 
     outputFile.write(data) 

     inputFile.seek(-len(oldSequence), 1) 
     outputFile.seek(-len(oldSequence), 1) 

    if len(data) < chunk: 
      break 

inputFile.close() 
outputFile.close()

來源

2009-06-17 17:04:42

這個基於生成器的版本一次只能保存文件內容的一個字符。

請注意，我正在逐字提取您的問題標題 - 您想將同一個字符的運行縮減爲單個字符。對於一般的更換圖案，這不起作用：

import StringIO 

def gen_chars(stream): 
    while True: 
     ch = stream.read(1) 
     if ch: 
     yield ch 
     else: 
     break 

def gen_unique_chars(stream): 
    lastchar = '' 
    for char in gen_chars(stream): 
     if char != lastchar: 
     yield char 
     lastchar=char 

def remove_seq(infile, outfile): 
    for ch in gen_unique_chars(infile): 
     outfile.write(ch) 

# Represents a file open for reading 
infile = StringIO.StringIO("1122233333444555") 

# Represents a file open for writing 
outfile = StringIO.StringIO() 

# Will print "12345" 
remove_seq(infile, outfile) 
outfile.seek(0) 
print outfile.read()

來源

2009-06-17 17:23:38 Triptych

AJMayorga建議是好的，除非替換字符串的大小是不同的。或者替換字符串位於塊的末尾。

我固定它是這樣的：

def ReplaceSequence(inFilename, outFilename, oldSeq, newSeq): 
    inputFile = open(inFilename, "rb") 
    outputFile = open(outFilename, "wb") 

data = "" 
chunk = 1024 

oldSeqLen = len(oldSeq) 

while 1: 
    data = inputFile.read(chunk) 

    dataSize = len(data) 
    seekLen= dataSize - data.rfind(oldSeq) - oldSeqLen 
    if seekLen > oldSeqLen: 
     seekLen = oldSeqLen 

    data = data.replace(oldSeq, newSeq) 
    outputFile.write(data) 
    inputFile.seek(-seekLen, 1) 
    outputFile.seek(-seekLen, 1) 

    if dataSize < chunk: 
     break 

inputFile.close() 
outputFile.close()

來源

2012-11-28 18:30:10 edasx

使用python從大型二進制文件中刪除字符序列

回答

相關問題