2015-10-16 20 views
1

我有一個包含所有具有相同擴展名的超過10 000個文件的目錄。都具有相同的形式,例如,在python中排序大量文件的問題

20150921(1)_0001.sgy 
20150921(1)_0002.sgy 
20150921(1)_0003.sgy 
20150921(1)_0004.sgy 
... 
20150921(1)_13290.sgy 

我目前使用的代碼是:

files = listdir('full data') 
files.sort() 

但是它返回如下列表:

20150921(1)_0001.sgy 
... 
20150921(1)_0998.sgy 
20150921(1)_0999.sgy 
20150921(1)_1000.sgy 
20150921(1)_10000.sgy 
20150921(1)_10001.sgy 
20150921(1)_10002.sgy 
20150921(1)_10003.sgy 
20150921(1)_10004.sgy 
20150921(1)_10005.sgy 
20150921(1)_10006.sgy 
20150921(1)_10007.sgy 
20150921(1)_10008.sgy 
20150921(1)_10009.sgy 
20150921(1)_1001.sgy 
20150921(1)_10010.sgy 

問題只出現當有超過1000個文件時,似乎排序不能正確地排列文件,如果它們大於10000. 任何人都可以看到解決方法嗎?

+0

我認爲緩慢的原因不是python的排序 - 它是文件系統。 –

+0

@EugeneSoldatov慢?我不認爲OP在這個問題上提到了速度。 – SethMMorton

回答

4

這就是所謂的Natural Sort。您可以使用natsort包來做到這一點:

from natsort import natsorted 
import pprint 

files = ['20150921(1)_0001.sgy', 
'20150921(1)_0102.sgy', 
'20150921(1)_0011.sgy', 
'20150921(1)_0003.sgy', 
'20150921(1)_0004.sgy', 
'20150921(1)_0010.sgy', 
'20150921(1)_1001.sgy', 
'20150921(1)_0012.sgy', 
'20150921(1)_0101.sgy', 
'20150921(1)_1003.sgy', 
'20150921(1)_0103.sgy', 
'20150921(1)_10002.sgy', 
'20150921(1)_1002.sgy', 
'20150921(1)_10001.sgy', 
'20150921(1)_0002.sgy', 
] 

pprint.pprint(natsorted(files)) 

此輸出:

['20150921(1)_0001.sgy', 
'20150921(1)_0002.sgy', 
'20150921(1)_0003.sgy', 
'20150921(1)_0004.sgy', 
'20150921(1)_0010.sgy', 
'20150921(1)_0011.sgy', 
'20150921(1)_0012.sgy', 
'20150921(1)_0101.sgy', 
'20150921(1)_0102.sgy', 
'20150921(1)_0103.sgy', 
'20150921(1)_1001.sgy', 
'20150921(1)_1002.sgy', 
'20150921(1)_1003.sgy', 
'20150921(1)_10001.sgy', 
'20150921(1)_10002.sgy'] 
+1

如果您最終需要對完整路徑(即帶有目錄的文件)進行排序,那麼添加'PATH'算法修改器將有助於獲得您期望的結果:'from natsort import natsorted,ns; natsorted(files,alg = ns.PATH)' – SethMMorton

+1

@Jingh爲了解釋爲什麼這會起作用(以及爲什麼默認排序沒有),請閱讀'natsort'文檔的前幾段來解釋發生了什麼。 – SethMMorton

0
sorted_filenames = sorted(os.listdir('full data'), key=lambda s: int(s.rsplit('.',1)[0].split("_",1)[1])) 
0

他們是按字母順序排序。如果你想按照數字對它們進行排序,你需要先進行一下解析:

def filename_to_tuple(name): 
     import re 
     match = re.match(r'(\d+)\((\d+)\)_(\d+)\.sgy', name) 
     if not match: 
     raise ValueError('Filename doesn't match expected pattern') 
     else: 
     return int(i for i in match.groups()) 

    sorted_files = sorted(os.listdir('full data'), key=filename_to_tuple)