2016-03-04 43 views
2

我有購買數據並希望用一個新列標記它們,它提供有關購買日期的信息。爲此,我使用每次購買的時間戳列的小時。通過Pandas DataFrame迭代,使用條件並添加列

標籤應該像這樣工作:

hour 4 - 7 => 'morning' 
hour 8 - 11 => 'before midday' 
... 

我拿起已經時間戳的時間。現在,我有一個DataFrame,其中包含50 mio的記錄,如下所示。

user_id timestamp    hour 
0 11  2015-08-21 06:42:44 6 
1 11  2015-08-20 13:38:58 13 
2 11  2015-08-20 13:37:47 13 
3 11  2015-08-21 06:59:05 6 
4 11  2015-08-20 13:15:21 13 

目前我的方法是使用6X .iterrows(),每一個不同的狀態:

for index, row in basket_times[(basket_times['hour'] >= 4) & (basket_times['hour'] < 8)].iterrows(): 
    basket_times['periode'] = 'morning' 

則:

for index, row in basket_times[(basket_times['hour'] >= 8) & (basket_times['hour'] < 12)].iterrows(): 
    basket_times['periode'] = 'before midday' 

等。

但是,50個mio記錄的6個循環中的一個已經花費了一個小時。有一個更好的方法嗎?

回答

1

您可以定義一個函數的n將時間段映射到您想要的字符串,然後使用map

def get_periode(hour): 
    if 4 <= hour <= 7: 
     return 'morning' 
    elif 8 <= hour <= 11: 
     return 'before midday' 

basket_times['periode'] = basket_times['hour'].map(get_periode) 
+0

作品完美!我也發現,我的方法根本不起作用。 –

0

您可以嘗試使用布爾型掩碼loc。我改變df來進行測試:

print basket_times 
    user_id   timestamp hour 
0  11 2015-08-21 06:42:44  6 
1  11 2015-08-20 13:38:58 13 
2  11 2015-08-20 09:37:47  9 
3  11 2015-08-21 06:59:05  6 
4  11 2015-08-20 13:15:21 13 

#create boolean masks 
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8) 
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11) 
aftermidday = (basket_times['hour'] >= 11) & (basket_times['hour'] < 15) 
print morning 
0  True 
1 False 
2 False 
3  True 
4 False 
Name: hour, dtype: bool 

print beforemidday 
0 False 
1 False 
2  True 
3 False 
4 False 
Name: hour, dtype: bool 
print aftermidday 
0 False 
1  True 
2 False 
3 False 
4  True 
Name: hour, dtype: bool 
basket_times.loc[morning, 'periode'] = 'morning' 
basket_times.loc[beforemidday, 'periode'] = 'before midday' 
basket_times.loc[aftermidday, 'periode'] = 'after midday' 
print basket_times 
    user_id   timestamp hour  periode 
0  11 2015-08-21 06:42:44  6  morning 
1  11 2015-08-20 13:38:58 13 after midday 
2  11 2015-08-20 09:37:47  9 before midday 
3  11 2015-08-21 06:59:05  6  morning 
4  11 2015-08-20 13:15:21 13 after midday 

時序 - len(df) = 500k

In [87]: %timeit a(df) 
10 loops, best of 3: 34 ms per loop 

In [88]: %timeit b(df1) 
1 loops, best of 3: 490 ms per loop 

代碼來進行測試:

import pandas as pd 
import io 

temp=u"""user_id;timestamp;hour 
11;2015-08-21 06:42:44;6 
11;2015-08-20 10:38:58;10 
11;2015-08-20 09:37:47;9 
11;2015-08-21 06:59:05;6 
11;2015-08-20 10:15:21;10""" 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1]) 
df = pd.concat([df]*100000).reset_index(drop=True) 
print df.shape 
#(500000, 3) 
df1 = df.copy() 

def a(basket_times): 
    morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8) 
    beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11) 
    basket_times.loc[morning, 'periode'] = 'morning' 
    basket_times.loc[beforemidday, 'periode'] = 'before midday' 
    return basket_times 

def b(basket_times): 
    def get_periode(hour): 
     if 4 <= hour <= 7: 
      return 'morning' 
     elif 8 <= hour <= 11: 
      return 'before midday' 

    basket_times['periode'] = basket_times['hour'].map(get_periode) 
    return basket_times 

print a(df)  
print b(df1)  
相關問題