2017-02-15 72 views
1

我有這樣一個數據幀添加新類別列,與10M行:如何在大熊貓

     probe 
time      
2016-01-01 00:05:00 3 
2016-01-01 00:05:00 1 
2016-01-01 00:05:00 5 
2016-01-01 00:05:00 5 
2016-01-01 00:05:00 4 
2016-01-01 00:05:00 2 
2016-01-01 00:05:00 5 
2016-01-01 00:05:00 6 
2016-01-01 00:05:00 3 
2016-01-01 00:05:00 4 
2016-01-01 00:05:00 5 
2016-01-01 00:05:00 2 
2016-01-01 00:05:00 3 
2016-01-01 00:05:00 3 
2016-01-01 00:05:00 5 
Name: probe, dtype: uint8 

我想基礎上,probe

def categorize_R(x): 
    return "inner" if x['probe'] in (1, 4) else "outer" 

data['category_R'] = pandas.Categorical(data.apply(categorize_R, axis=1)) 

這個值加categoricat列非常緩慢。實際計算面具是這樣的:

mask_inner = (x['probe'] == 1) | (x['probe'] == 4) 

是相當快的,但我不知道如何添加類型分類的列。

回答

1

我認爲你需要numpy.wherebetween創建面膜:

mask = data.probe.between(1,4) 
data['category_R'] = pd.Categorical(np.where(mask, 'inner', 'outer')) 
print (data) 
        probe category_R 
time         
2016-01-01 00:05:00  3  inner 
2016-01-01 00:05:00  1  inner 
2016-01-01 00:05:00  5  outer 
2016-01-01 00:05:00  5  outer 
2016-01-01 00:05:00  4  inner 
2016-01-01 00:05:00  2  inner 
2016-01-01 00:05:00  5  outer 
2016-01-01 00:05:00  6  outer 
2016-01-01 00:05:00  3  inner 
2016-01-01 00:05:00  4  inner 
2016-01-01 00:05:00  5  outer 
2016-01-01 00:05:00  2  inner 
2016-01-01 00:05:00  3  inner 
2016-01-01 00:05:00  3  inner 
2016-01-01 00:05:00  5  outer 

另一種解決方案是使用Categorical.from_codes,還要檢查object creation - In [28]:

mask = (data['probe']==1) | (data['probe']==3) | (data['probe']==4) 

mask = (data['probe']==1) | (data['probe']==3) | (data['probe']==4) 
data['category_R'] = pd.Categorical(np.where(mask, 'inner', 'outer')) 
data['category_R1'] = pd.Categorical.from_codes(mask, ['outer','inner']) 
print (data) 
        probe category_R category_R1 
time            
2016-01-01 00:05:00  3  inner  inner 
2016-01-01 00:05:00  1  inner  inner 
2016-01-01 00:05:00  5  outer  outer 
2016-01-01 00:05:00  5  outer  outer 
2016-01-01 00:05:00  4  inner  inner 
2016-01-01 00:05:00  2  outer  outer 
2016-01-01 00:05:00  5  outer  outer 
2016-01-01 00:05:00  6  outer  outer 
2016-01-01 00:05:00  3  inner  inner 
2016-01-01 00:05:00  4  inner  inner 
2016-01-01 00:05:00  5  outer  outer 
2016-01-01 00:05:00  2  outer  outer 
2016-01-01 00:05:00  3  inner  inner 
2016-01-01 00:05:00  3  inner  inner 
2016-01-01 00:05:00  5  outer  outer 

時序

In [181]: %timeit pd.Categorical(np.where(mask, 'inner', 'outer')) 
1000 loops, best of 3: 196 µs per loop 

In [182]: %timeit pd.Categorical.from_codes(mask, ['outer','inner']) 
10000 loops, best of 3: 139 µs per loop 
+0

我們很近。關鍵是我需要做更復雜的事情,比如'(x [probe] == 1)| (x [probe] == 3)| (x [probe] == 4)' –

+0

此外:避免使用字符串創建中間序列並直接創建類別爲 –

+0

的''numpy.where'非常快速並且輸出爲numpy數組,你也可以根據需要更換面具 - 檢查最後的編輯。 – jezrael