2014-07-20 63 views
0

我有一個很大的數據框,有11列,我想用零替換NaN值,如果另一組列中的每個值都是NaN,否則將不爲空的數字轉換爲整數。我以下面的方式這樣做,但只有8000個觀測值需要很長時間才能完成(儘管它的確如此)。我認爲這場耗時近20分鐘:有條件的NaN填充

lt = ['lost_time_a', 'lost_time_b', 'lost_time_c', 'lost_time_d', 'lost_time_e', 'lost_time_f', 'lost_time_g', 
     'lost_time_h', 'lost_time_i', 'lost_time_j', 'ttl'] 
ht = ['hour1', 'hour2', 'hour3', 'hour4', 'hour5', 'hour6', 'hour7', 'hour8', 'hour9', 'hour10', 'hour11', 
     'hour12', 'hour13', 'hour14', 'hour15'] 

for row in FinalDF.index: 
    if not all([pd.isnull(FinalDF.loc[row, col]) for col in ht]): 
     for Col_ in lt: 
      val = FinalDF.loc[row, Col_] 
      if pd.isnull(val): 
       FinalDF.loc[row, Col_] = 0 
      else: 
       FinalDF.loc[row, Col_] = int(val) 

所有幫助表示讚賞

下面是一些測試數據給你的鄉親:

import pandas as pd 
import numpy as np 
from numpy import nan as NA 
FinalDF = pd.DataFrame({'hour1' : [NA, NA, NA, 70, 60], 
        'hour2' : [100, 50, NA, 120, 100], 
        'hour3' : [120, 80, NA, 130, 100], 
        'hour4' : [140, 90, NA, 120, 70], 
        'hour5' : [130, 200, NA, NA, NA], 
        'hour6' : [NA, NA, NA, 70, 60], 
        'hour7' : [100, 50, NA, 120, 100], 
        'hour8' : [120, 80, NA, 130, 100], 
        'hour9' : [140, 90, NA, 120, 70,], 
        'hour10' :[130, 200, NA, NA, NA], 
        'hour11' : [NA, NA, NA, 70, 60], 
        'hour12' : [100, 50, NA, 120, 100], 
        'hour13' : [120, 80, NA, 130, 100], 
        'hour14' : [140, 90, NA, 120, 70], 
        'hour15' : [130, 200, NA, NA, NA], 
        'lost_time_a' : [NA, NA, NA, NA, NA], 
        'lost_time_b' : [NA, 1.0, NA, NA, 4.1], 
        'lost_time_c' : [NA, NA, NA, NA, 10.1], 
        'lost_time_d' : [1, 2.3, NA, NA, 1], 
        'lost_time_e' : [NA, NA, NA, NA, NA], 
        'lost_time_f' : [NA, 1.0, NA, NA, 4.1], 
        'lost_time_g' : [NA, NA, NA, NA, 10.1], 
        'lost_time_h' : [1, 2.3, NA, NA, 1], 
        'lost_time_i' : [NA, NA, NA, NA, NA], 
        'lost_time_j' : [NA, 1.0, NA, NA, 4.1], 
        'ttl'   : [NA, NA, NA, NA, NA]}) 

的部分輸出(失去的時間變量)

Out[18]: 
    lost_time_a lost_time_b lost_time_c lost_time_d lost_time_e 
0   0   0   0   1   0 
1   0   1   0   2   0 
2   NaN   NaN   NaN   NaN   NaN 
3   0   0   0   0   0 
4   0   4   10   1   0 
+0

你可以製作一個獨立的例子,人們可以複製和粘貼測試? – DSM

+0

已添加與發佈的代碼段相關的測試數據。 –

回答

1

未經測試,但我認爲這會做你想要的? cond是一個布爾序列,當ht中的所有列都爲空時爲true。

for c in lt: 
    cond = pd.isnull(FinalDF[ht]).all(axis=1) 
    FinalDF[c] = np.where(cond, FinalDF[c].fillna(0).astype(int), FinalDF[c]) 
2

我認爲這會產生相同的結果代碼:

def fix(df, ht, lt): 
    df = df.copy() 
    to_fix = ~df[ht].isnull().all(axis=1), lt 
    df.loc[to_fix] = df.loc[to_fix].fillna(0).astype(int) 
    return df 

(顯然,如果您能夠接受就地變化可以刪除副本。)

>>> df.iloc[:,-5:] 
    lost_time_g lost_time_h lost_time_i lost_time_j ttl 
0   NaN   1.0   NaN   NaN NaN 
1   NaN   2.3   NaN   1.0 NaN 
2   NaN   NaN   NaN   NaN NaN 
3   NaN   NaN   NaN   NaN NaN 
4   10.1   1.0   NaN   4.1 NaN 
>>> fix(df, ht, lt).iloc[:, -5:] 
    lost_time_g lost_time_h lost_time_i lost_time_j ttl 
0   0   1   0   0 0 
1   0   2   0   1 0 
2   NaN   NaN   NaN   NaN NaN 
3   0   0   0   0 0 
4   10   1   0   4 0 
>>> from pandas.util.testing import assert_frame_equal 
>>> assert_frame_equal(orig(df, ht, lt), fix(df, ht, lt)) 
>>>