我是熊貓新手。我寫了一個我想優化的代碼,但我不知道如何。我意識到'apply'和'pandas'矢量化比'iterrows'更快的事實,但不知道如何使用它們來實現相同的目標。因爲它與'for'循環類似,所以我很容易使用它,所以我習慣了它。 這裏是我的代碼:優化熊貓代碼:取代'iterrows'和其他想法
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from scipy.spatial.distance import euclidean
data = pd.read_csv(r'C:\temp\train.txt')
def group_df(df,num):
ln = len(df)
rang = np.arange(ln)
splt = np.array_split(rang,num)
lst = []
finel_lst = []
for i,x in enumerate(splt):
lst.append([i for x in range(len(x))])
for k in lst:
for j in k:
finel_lst.append(j)
df['group'] = finel_lst
return df
def KNN(dafra,folds,K,fi,target):
df = group_df(dafra,folds)
avarge_e = []
for i in range(folds):
train = pd.DataFrame(df.loc[df['group'] != i])
test = pd.DataFrame(df.loc[df['group'] == i])
test.loc[:,'pred_price'] = np.nan
test.loc[:,'rmse'] = np.nan
train.loc[:,'dis'] = np.nan
train = train.reset_index()
test = test.reset_index()
for index,row in test.iterrows():
for index2,row2 in train.iterrows():
train.loc[index2]['dis'] = euclidean(row2[fi],row[fi])
正如你所看到的,有2嵌套 'iterrows' 循環。頂部還有1個小'for'循環。 這個代碼的想法是分配測試的每一行之間的歐幾里得距離列車的每一行。但是,由於測試通過'for'循環進行了更改,它最終將添加到所有原始DataFrame。
這裏是數據的beggining:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal
\
0 Lvl AllPub ... 0 NaN NaN NaN 0
1 Lvl AllPub ... 0 NaN NaN NaN 0
2 Lvl AllPub ... 0 NaN NaN NaN 0
MoSold YrSold SaleType SaleCondition SalePrice
0 2 2008 WD Normal 208500
1 5 2007 WD Normal 181500
2 9 2008 WD Normal 223500
[3行×81列]
爲optimaing此代碼任何想法將受到歡迎。謝謝。
你能提供的測試數據? –
你的意思是:我的DataFrame的第一個排隊? –
是的,你想存檔 –