2017-10-19 143 views
2

我有一個數據幀的路徑。任務是使用類似datetime.fromtimestamp(os.path.getmtime('PATH_HERE'))成一個單獨的列熊貓矢量化而不是循環

import pandas as pd 
import numpy as np 
import os 


df1 = pd.DataFrame({'Path' : ['C:\\Path1' ,'C:\\Path2', 'C:\\Path3']}) 

#for a MVCE use the below commented out code. WARNING!!! This WILL Create directories on your machine. 
#for path in df1['Path']: 
# os.mkdir(r'PUT_YOUR_PATH_HERE\\' + os.path.basename(path)) 

我可以用下面的做任務得到的最後修改時間爲文件夾,但它是一個緩慢的循環,如果我有很多文件夾:

for each_path in df1['Path']: 
    df1.loc[df1['Path'] == each_path, 'Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(each_path)) 

我該如何去引導這個過程來提高速度? os.path.getmtime不能接受該系列。我在尋找類似:

df1['Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(df1['Path']))

+0

'df1 ['Path'] .application(lambda x:datetime.fromtimestamp(os.path.getmtime(x)))'?? – Dark

+0

如果'os.path.getmtime'不能接受這個系列,那麼廣播就無法完成,所以我不認爲你可以得到一個矢量化的解決方案。 – Dark

+0

@Bharathshetty,應用方法*在我的短期測試中速度更快。每個循環約300ms。不幸的是,我害怕一個非矢量化的解決方案不可能 – MattR

回答

0

我要去假設使用100條路徑的3種方法。我認爲方法3是優選的。

# Data initialisation: 
paths100 = ['a_whatever_path_here'] * 100 
df = pd.DataFrame(columns=['paths', 'time']) 
df['paths'] = paths100 


def fun1(): 
    # Naive for loop. High readability, slow. 
    for path in df['paths']: 
     mask = df['paths'] == path 
     df.loc[mask, 'time'] = datetime.fromtimestamp(os.path.getmtime(path)) 


def fun2(): 
    # Naive for loop optimised. Medium readability, medium speed. 
    for i, path in enumerate(df['paths']): 
     df.loc[i, 'time'] = datetime.fromtimestamp(os.path.getmtime(path)) 


def fun3(): 
    # List comprehension. High readability, high speed. 
    df['time'] = [datetime.fromtimestamp(os.path.getmtime(path)) for path in df['paths']] 


% timeit fun1() 
>>> 164 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

% timeit fun2() 
>>> 11.6 ms ± 67.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

% timeit fun3() 
>>> 13.1 ns ± 0.0327 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each) 
+0

#3適用於我,它是我測試過程中速度最快的 – MattR

+0

有趣的是我使用了相同類型的邏輯來測試其他類似於這個問題的函數# 3只是針對*這個特定的場景而言更快。@Bharath shetty在評論中提到的apply方法在其他場景中是最快的 – MattR

0

可以使用GROUPBY transform(讓你每組做昂貴的調用僅一次):

g = df1.groupby("Path")["Path"] 
s = pd.to_datetime(g.transform(lambda x: os.path.getmtime(x.name))) 
df1["Last Modification Time"] = s # putting this on two lines so it looks nicer... 
+0

只有當路徑列重複時纔會節省時間... –

+0

我不會有重複的路徑,但這對於其他代碼問題肯定很方便。作爲一個附註:在'os.path.getmtime'附近添加'datetime.fromtimestamp()'或者其他值不正確 – MattR

+0

@AndyHayden因爲OP對於每個文件夾都有唯一的路徑 – Dark