這裏是我使用的代碼:爲什麼多處理池的這個實現不起作用?
import pandas as pd
import sys, multiprocessing
train_data_file = '/home/simon/ali_bigdata/train_data_user_2.0.csv'
user_list_file = '/home/simon/ali_bigdata/user_list.txt'
def feature_extract(list_file, feature_extract_func):
tmp_list = [line.strip() for line in open(list_file)]
pool = multiprocessing.Pool(multiprocessing.cpu_count())
results_list = pool.map(feature_extract_func, tmp_list)
for tmp in results_list:
for i in tmp:
print i,"\t",
print "\n"
pool.close()
pool.join()
def user_feature(tmp_user_id):
sys.stderr.write("process user " + tmp_user_id + " ...\n")
try:
tmp_user_df = df_user.loc[int(tmp_user_id)]
except KeyError:
return [tmp_user_id, 0, 0, 0.0]
else:
if type(tmp_user_df) == pd.core.series.Series:
tmp_user_click = 1
else:
(tmp_user_click, suck) = tmp_user_df.shape
tmp_user_buy_df = tmp_user_df.loc[tmp_user_df['behavior_type'] == 4]
if type(tmp_user_buy_df) == pd.core.frame.DataFrame:
tmp_user_buy = 1
else:
(tmp_user_buy, suck) = tmp_user_buy_df.shape
return [tmp_user_id, tmp_user_click, tmp_user_buy, 0.0 if tmp_user_click == 0 else float(tmp_user_buy)/tmp_user_click]
df = pd.read_csv(train_data_file, header=0)
df_user = df.set_index(['user_id'])
feature_extract(user_list_file, user_feature)
我得到的錯誤是:
process user 102761946 ...
process user 110858443 ...
process user 131681429 ...
Traceback (most recent call last):
File "extract_feature_2.0.py", line 53, in <module>
feature_extract(user_list_file, user_feature)
File "extract_feature_2.0.py", line 13, in feature_extract
results_list = pool.map(feature_extract_func, tmp_list)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
KeyError: 'the label [False] is not in the [index]'
當程序運行一段時間它發生。
那麼這個錯誤是什麼意思,我該如何多處理這個映射函數呢?
這裏的輸入數據格式
user_id,item_id,behavior_type,user_geohash,item_category,date,time
99512554,37320317,3,94gn6nd,9232,2014-11-26,20
9909811,266982489,1,,3475,2014-12-02,23
98692568,27121464,1,94h63np,5201,2014-11-19,13
在調試方法的幫助下,我終於修好了!感謝:) – 2015-04-07 07:04:37
@simon_xia:到底是什麼問題? – mhawke 2015-04-07 07:37:58
@mhawke它是由'df_user.loc [int(tmp_user_id)]'的返回值引起的,當只有一行滿足條件時,它可能是一個系列。所以'tmp_user_buy_df = tmp_user_df.loc [tmp_user_df ['behavior_type'] == 4]'這個語句將細分 – 2015-04-11 01:11:35