3
我試圖計算交叉驗證方案內的組平均值,但是這種迭代方法非常慢,因爲我的數據幀包含多於1mln的行。是否有可能對此計算進行矢量化?謝謝。向量化大熊貓計算
import pandas as pd
import numpy as np
data = np.column_stack([np.arange(1,101), np.random.randint(1,11, 100),np.random.randint(1,101, 100)])
df = pd.DataFrame(data, columns=['id', 'group','total'])
from sklearn.cross_validation import KFold
kf = KFold(df.shape[0], n_folds=3, shuffle = True)
f = {'total': ['mean']}
df['fold'] = 0
df['group_average'] = 0
for train_index, test_index in kf:
df.ix[train_index, 'fold'] = 0
df.ix[test_index, 'fold'] = 1
aux = df.loc[df.fold == 0, :].groupby(['group'])
aux2 = aux.agg(f)
aux2.reset_index(inplace = True)
aux2.columns = ['group', 'group_average']
for i, row in df.loc[df.fold == 1, :].iterrows():
new = aux2.ix[(aux2.group == row.group),'group_average']
if new.empty == True:
new = 0
else:
new = new.values[0]
df.ix[i, 'group_average'] = new
您能否提供示例輸入和輸出數據,以便我們運行您的代碼? – Khris
@Khris對不起,我編輯了代碼,你現在應該可以運行了。 –
嘗試應用lambda函數,但速度更慢。 – Khris