2017-03-12 47 views
1

必須有更好的方法來做到這一點,請幫助我Python/Pandas:如何整合不同列中的NaN重複行?

下面是我必須清理的一些數據的摘錄,其中有幾種「重複」行(並非所有行都重複):

DF =

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ... 
-------+------------+------------+-------------+--------------+----- 
    100 | ABC  | Paid  |   NaN |  34200 | 
    100 | ABC  | Paid  |   724 |  34200 | 
    200 | DEF  | Write Off |   611 |   9800 | 
    200 | DEF  | Write Off |   611 |   NaN | 
    300 | GHI  | Paid  |   NaN |  247112 | 
    300 | GHI  | Paid  |   799 |   NaN | 
    400 | JKL  | Paid  |   NaN |   NaN | 
    500 | MNO  | Paid  |   444 |   NaN | 

所以,我有以下類型的重複的情況下:

  1. 的NAN和在柱CreditScore等有效的值(的LoanID = 100)
  2. 的NAN和在柱AnnualIncome(的LoanID = 200)
  3. 的NAN和在柱CreditScore等,並且認爲NaN和在柱AnnualIncome()的有效值貸款ID = 300的有效值的有效值
  4. 的LoanID 400和500是「正常」的情況下

所以,很顯然我想要的是有沒有像重複數據框一:

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ... 
-------+------------+------------+-------------+--------------+----- 
    100 | ABC  | Paid  |   724 |  34200 | 
    200 | DEF  | Write Off |   611 |   9800 | 
    300 | GHI  | Paid  |   799 |  247112 | 
    400 | JKL  | Paid  |   NaN |   NaN | 
    500 | MNO  | Paid  |   444 |   NaN | 

所以,我如何與解決了這個:

# Get the repeated keys: 
rep = df['LoanID'].value_counts() 
rep = rep[rep > 2] 

# Now we get the valid number (we overwrite the NaNs) 
for i in rep.keys(): 
    df.loc[df['LoanID'] == i, 'CreditScore'] = df[df['LoanID'] == i]['CreditScore'].max() 
    df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max() 

# Drop duplicates 
df.drop_duplicates(inplace=True) 

這是行得通的,正是我需要的,問題是這個數據幀是幾個100k記錄,所以這個方法需要「永遠」,必須有一些方法來做得更好,對嗎?

回答

2

分組通過貸款ID,在上面和下面缺失值填充和刪除重複似乎工作:

df.groupby('LoanID').apply(lambda x: \ 
          fillna(method='ffill').\ 
          fillna(method='bfill').\ 
          drop_duplicates()).\ 
        reset_index(drop=True).\ 
        set_index('LoanID') 
#  CustomerID LoanStatus CreditScore AnnualIncome 
#LoanID                
#100   ABC  Paid  724.0  34200.0  
#200   DEF Write Off  611.0  9800.0  
#300   GHI  Paid  799.0  247112.0  
#400   JKL  Paid   NaN   NaN  
#500   MNO  Paid  444.0   NaN  
相關問題