2016-11-25 89 views
1

我有一個數據幀熊貓與programdatasetalgorithm,和result字段,其​​中result表示在一個特定的算法和數據集上運行的程序的運行時間。一些結果不見了。我想用datasetalgorithm作爲參考程序Program-A來填補這些缺失結果。熊貓:使用fillna與數據幀作爲value參數

我很樂意就如何改進我的代碼採取任何建議。但我的具體問題是,爲什麼我無法將DataFrame傳遞到fillna的值參數中,而是必須將其轉換爲字典。 (該文件說value : scalar, dict, Series, or DataFrame

col = ['program', 'dataset', 'algorithm', 'result'] 
df = pandas.DataFrame(
    [['program-A', 'dataset-X', 'algorithm-i', 1], 
    ['program-A', 'dataset-X', 'algorithm-j', 2], 
    ['program-A', 'dataset-Y', 'algorithm-i', 3], 
    ['program-A', 'dataset-Y', 'algorithm-j', 4], 
    ['program-B', 'dataset-X', 'algorithm-j', numpy.NaN] 
    ], columns=col) 

df['algorithm_dataset'] = df['algorithm'] + "_" + df['dataset'] 

# build a dict from {algorithm+dataset} to result 
dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset', 
              'result']] 
dfg = dfg.set_index('algorithm_dataset') 
dfg_dict = dfg.to_dict()['result'] 

df = df.set_index('algorithm_dataset') 
# df['result'] = df['result'].fillna(value=dfg) 
# what's above doesn't work: 
# ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'> 
# so instead: 
df['result'] = df['result'].fillna(value=dfg_dict) 
df = df.reset_index() 

print df 

版本:

$ port installed | grep pandas 
    py27-pandas @0.19.1_0 (active) 
$ python --version 
Python 2.7.12 

回答

1

如果需要可與column您可以使用Series代替dictfillnaSeries):

ser = dfg.set_index('algorithm_dataset')['result'] 
print (ser) 
algorithm_dataset 
algorithm-i_dataset-X 1.0 
algorithm-j_dataset-X 2.0 
algorithm-i_dataset-Y 3.0 
algorithm-j_dataset-Y 4.0 
Name: result, dtype: float64 

df = df.set_index('algorithm_dataset') 
df['result1'] = df['result'].fillna(value=ser) 
print (df) 
         program dataset algorithm result result1 
algorithm_dataset               
algorithm-i_dataset-X program-A dataset-X algorithm-i  1.0  1.0 
algorithm-j_dataset-X program-A dataset-X algorithm-j  2.0  2.0 
algorithm-i_dataset-Y program-A dataset-Y algorithm-i  3.0  3.0 
algorithm-j_dataset-Y program-A dataset-Y algorithm-j  4.0  4.0 
algorithm-j_dataset-X program-B dataset-X algorithm-j  NaN  2.0 

df['result'] = df['result'].fillna(value=ser) 
print (df) 
         program dataset algorithm result 
algorithm_dataset            
algorithm-i_dataset-X program-A dataset-X algorithm-i  1.0 
algorithm-j_dataset-X program-A dataset-X algorithm-j  2.0 
algorithm-i_dataset-Y program-A dataset-Y algorithm-i  3.0 
algorithm-j_dataset-Y program-A dataset-Y algorithm-j  4.0 
algorithm-j_dataset-X program-B dataset-X algorithm-j  2.0 

如果DataFrame需要fillna,你必須用相同的index與同一列的第一另一DataFrame創建,然後它的工作原理:

dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset', 
              'result']] 

dfg = dfg.set_index('algorithm_dataset')['result'].to_frame() 
print (dfg) 
         result 
algorithm_dataset    
algorithm-i_dataset-X  1.0 
algorithm-j_dataset-X  2.0 
algorithm-i_dataset-Y  3.0 
algorithm-j_dataset-Y  4.0 

df = df.set_index('algorithm_dataset') 
df = df.drop(['program','dataset','algorithm'], axis=1) 
print (df) 
         result 
algorithm_dataset    
algorithm-i_dataset-X  1.0 
algorithm-j_dataset-X  2.0 
algorithm-i_dataset-Y  3.0 
algorithm-j_dataset-Y  4.0 
algorithm-j_dataset-X  NaN 

dfg = dfg.reindex(df.index) 
print (dfg) 
         result 
algorithm_dataset    
algorithm-i_dataset-X  1.0 
algorithm-j_dataset-X  2.0 
algorithm-i_dataset-Y  3.0 
algorithm-j_dataset-Y  4.0 
algorithm-j_dataset-X  2.0 
df = df.fillna(dfg) 
print (df) 
lgorithm_dataset    
algorithm-i_dataset-X  1.0 
algorithm-j_dataset-X  2.0 
algorithm-i_dataset-Y  3.0 
algorithm-j_dataset-Y  4.0 
algorithm-j_dataset-X  2.0 
+0

感謝您給我們詳細的,有用的答案。 – jowens