2013-12-07 47 views
0

我有一個有13000個文件的gzip壓縮包。我如何從Python程序中只提取其中最大的文件?找到tarball裏面最大的文件

我已經試過通過tarball閱讀並檢查每個文件的提取長度,但這需要太長的時間。有沒有更好的方法來做到這一點?

原代碼(增加了對這個問題的完整起見,即使是選擇一個答案):

from tarfile import TarFile 
archive = TarFile(filename) 
members = archive.getmembers() 
sizes = [] 
for member in members: 
    sizes.append(member.size) 
largest = max(sizes) 
largest_info = sizes.index(largest) 
print(largest_info.name) 
+0

您如何期望在不查看tarball中的所有文件的情況下找到最大的文件? –

回答

3

你有沒有在the documentation看?

import tarfile 
archive = tarfile.TarFile('/path/to/my/tarfile.tar') 
max_size = 0 
max_name = None 
for file in archive.getmembers(): 
    if file.size > max_size: 
     max_size = file.size 
     max_name = file.name 

print(max_size) 
print(max_name) 
+0

內置'max'函數不會更好嗎? 'max(archive.getmembers(),key = operator.itemgetter('size'))' – mgilson

+0

我得到'TypeError:'TarInfo'對象不是可執行的。 –

+2

'max(archive.getmembers(),key = operator.attrgetter('size'))'似乎工作正常。 – Alphadelta14

2

答案是你必須查看所有檔案找出最大的成員。這是因爲TAR文件的目的是爲歸檔類型,因此沒有目錄(TOC):

The possible reason for not using a centralized location of information is that tar was originally meant for tapes, which are bad at random access anyway: if the Table Of Contents (TOC) were at the start of the archive, creating it would mean to first calculate all the positions of all files, which needs doubled work, a big cache, or rewinding the tape after writing everything to write the TOC

緬維瑟與工作代碼提供你。