2013-10-24 88 views
2

我想拿出一個regex只在使用os.walk掃描根文件夾時過濾掉一個或多個特定類型(擴展名)的文件。我的文件夾結構(待搜索)看起來像這樣。請注意沒有擴展名的文件。REGEX包括文件沒有擴展名和所有擴展名除外某些人(PNG或JPG)

Directory: D:\Projects\5 Codes Cleaned\2012 

SG 
|---SG.zip 
|---SOIL-Average.jpg 
|---SWAT-Average.jpg 
|---Test 
1500_LT_Capped_2012 
1500_LT_Capped_2012 
1500_LT_Capped_2012 
1500_LT_Capped_2012 
1500_LT_Capped_2012 
1500_LT_Capped_2012 
1500_LT_Capped_2012 
1500_LT_Capped_2012 
PRESSURE-Average.png 
SGAS-Average.png 
SOIL-Average.png 
SWAT-Average.png 

或列表格式:

[u'D:\\Projects\\5 Codes Cleaned\\2012\\1500_LT_Capped_2012 -P', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\1500_LT_Capped_2012 -P.npy', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\1500_LT_Capped_2012 -Sg', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\1500_LT_Capped_2012 -Sg.npy', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\1500_LT_Capped_2012 -So', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\1500_LT_Capped_2012 -So.npy', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\1500_LT_Capped_2012 -Sw', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\1500_LT_Capped_2012 -Sw.npy', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\PRESSURE-Average.png', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\SGAS-Average.png', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\SOIL-Average.png', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\SWAT-Average.png', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\SG\\SG.zip', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\SG\\SOIL-Average.jpg', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\SG\\SWAT-Average.jpg', 
u'D:\\Projects\\5 Codes Cleaned\\2012\\SG\\Test'] 

我查閱了一些線程會在這裏得到一些想法,但我想知道是否有任何更簡單的方式來做到這一點。到目前爲止,我已經嘗試下面的模式來過濾掉os.walk 結果:

regex = "^.*(?<!\.png)(?<!\.npy)$"  
# The only working one but tends to get messy 
# as more file types are to be excluded! 

&

regex = "^(.+?)(?:\.(?:png|jpg))*$" 
# Does not filter out jpg or png...list all files 

&

regex = '^.*\.(?!jpg$|png$)[^.]+$'  
# Filters out png & jpg but Does not include No-Extensions ! 

&

regex = '^.*\.*(?!.jpg$|.png$)'  
# Does not filter out png & jpg file 

回答

2

爲什麼不只是使用os.path.splitext和列表理解?

disallowed_types = ['png', 'jpg'] 

allowed = [x for x in allfiles if os.path.splitext(x)[1] not in disallowed_types] 

,但如果你必須使用正則表達式,這似乎工作,但倒:

regex = '[^.]*?\.+(jpg$|png$)' 

這樣的話,如果它匹配這一點,這是一個JPG或PNG,不應該被包括在內,否則它是安全的並可以包含在列表中。

+1

這是這個特定任務的好方法。但是,我正在嘗試爲os.walk編寫一個通用的模式過濾器,以便在各種代碼中隨處使用。所以,我傾向於正則表達式。 – Moe

0

爲什麼不乾脆:

>>> extensions = "png jpg npy".split() 
>>> regex = "^.*%s$" 
>>> regex%"".join("(?<!\.%s)"%i for i in extensions) 
'^.*(?<!\\.png)(?<!\\.jpg)(?<!\\.npy)$' 

然後:

all = [...] 
# If you want to get all files with extension `'jpg', 'png', '...'`: 

[i for i in [compiled.match(x) for x in a] if not i] 

# Will get you all files with no extension, with .png or with .jpg 

# If you want all OTHER files (like .zip): 

[i for i in [compiled.match(x) for x in a] if i]