這裏是一個Python腳本,你可以使用分裂使用subprocess
大文件:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
,可在外部調用它:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
您還可以導入subprocess
並直接在程序中運行它。
此方法的問題是內存使用率高:subprocess
創建一個內存佔用空間與您的進程大小相同的分叉,並且如果進程內存已經很大,它會在運行時加倍。與os.system
同樣的事情。
這裏是這樣做的另一個純Python的方式,雖然我沒有測試它的巨大的文件,它會慢一些,但對於內存精簡:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = files.yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
不受歡迎的建議:獲得更好的文本編輯器。 :-)如果你在Windows上,EmEditor是我知道的,它可以無縫地編輯文件,而無需將它們完全加載到內存中。 – bobince 2008-11-15 13:00:35