1
必須有更好的方法來做到這一點,請幫助我Python/Pandas:如何整合不同列中的NaN重複行?
下面是我必須清理的一些數據的摘錄,其中有幾種「重複」行(並非所有行都重複):
DF =
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | NaN | 34200 |
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
200 | DEF | Write Off | 611 | NaN |
300 | GHI | Paid | NaN | 247112 |
300 | GHI | Paid | 799 | NaN |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
所以,我有以下類型的重複的情況下:
- 的NAN和在柱CreditScore等有效的值(的LoanID = 100)
- 的NAN和在柱AnnualIncome(的LoanID = 200)
- 的NAN和在柱CreditScore等,並且認爲NaN和在柱AnnualIncome()的有效值貸款ID = 300的有效值的有效值
- 的LoanID 400和500是「正常」的情況下
所以,很顯然我想要的是有沒有像重複數據框一:
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
300 | GHI | Paid | 799 | 247112 |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
所以,我如何與解決了這個:
# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]
# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
df.loc[df['LoanID'] == i, 'CreditScore'] = df[df['LoanID'] == i]['CreditScore'].max()
df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
# Drop duplicates
df.drop_duplicates(inplace=True)
這是行得通的,正是我需要的,問題是這個數據幀是幾個100k記錄,所以這個方法需要「永遠」,必須有一些方法來做得更好,對嗎?