基於列的子集合並和更新數據幀

我想知道是否有最快的代碼替換兩個for循環，假設df大小非常大。在我的實際情況中，每個數據幀都是200行和25列。基於列的子集合並和更新數據幀

data_df1 = np.array([['Name','Unit','Attribute','Date'],['a','A',1,2014],['b','B',2,2015],['c','C',3,2016],\ 
       ['d','D',4,2017],['e','E',5,2018]]) 
data_df2 = np.array([['Name','Unit','Date'],['a','F',2019],['b','G',2020],['e','H',2021],\ 
       ['f','I',2022]]) 
df1 = pd.DataFrame(data=data_df1) 
print('df1:') 
print(df1) 
df2 = pd.DataFrame(data=data_df2) 
print('df2:') 
print(df2) 
row_df1 = [1,2,5] 
col_df1 = [1,3] 
row_df2 = [1,2,3] 
col_df2 = [1,2] 
for i in range(0,len(row_df1)): 
    for j in range(0, len(col_df1)): 
     df1.set_value(row_df1[i],col_df1[j], df2.loc[row_df2[i],col_df2[j]]) 
print('df1 after operation:') 
print(df1)

預期輸出：

df1: 
     0  1   2  3 
0 Name Unit Attribute Date 
1  a  A   1 2014 
2  b  B   2 2015 
3  c  C   3 2016 
4  d  D   4 2017 
5  e  E   5 2018 
df2: 
     0  1  2 
0 Name Unit Date 
1  a  F 2019 
2  b  G 2020 
3  e  H 2021 
4  f  I 2022 
df1 after operation: 
     0  1   2  3 
0 Name Unit Attribute Date 
1  a  F   1 2019 
2  b  G   2 2020 
3  c  C   3 2016 
4  d  D   4 2017 
5  e  H   5 2021

我曾嘗試：

df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]] 
print('df1:') 
print(df1) 
print('df2:') 
print(df2)

但結果如下。有意想不到的南。

df1: 
     0  1   2  3 
0 Name Unit Attribute Date 
1  a  F   1 NaN 
2  b  G   2 NaN 
3  c  C   3 2016 
4  d  D   4 2017 
5  e NaN   5 NaN 
df2: 
     0  1  2 
0 Name Unit Date 
1  a  F 2019 
2  b  G 2020 
3  e  H 2021 
4  f  I 2022

在此先感謝任何幫助。

來源

2017-09-15 John

某些清洗：

def clean_df(df): 
    df.columns = df.iloc[0] 
    df.columns.name = None   
    df = df.iloc[1:].reset_index() 

    return df 

df1 = clean_df(df1) 
df1 
    index Name Unit Attribute Date 
0  1 a A   1 2014 
1  2 b B   2 2015 
2  3 c C   3 2016 
3  4 d D   4 2017 
4  5 e E   5 2018 

df2 = clean_df(df2) 
df2  
    index Name Unit Date 
0  1 a F 2019 
1  2 b G 2020 
2  3 e H 2021 
3  4 f I 2022

使用merge，指定on=Name，因此其他列不列入考慮。基於換位據幀

cols = ['Name', 'Unit_y', 'Attribute', 'Date_y'] 
df1 = df1.merge(df2, how='left', on='Name')[cols]\ 
       .rename(columns=lambda x: x.split('_')[0]).fillna(df1) 

df1 
    Name Unit Attribute Date 
0 a F   1 2019 
1 b G   2 2020 
2 c C   3 2016 
3 d D   4 2017 
4 e H   5 2021

來源

2017-09-15 13:30:37

@John我已經告訴你如何得到你的輸出。 –

@John如果你要堅持並說你得到了錯誤的答案，那是因爲你的數據，而不是我的問題。我應該注意到這是你第二次這樣做，並拒絕承認回答你的問題和隨之而來的破壞性數據所付出的努力。 –

@COLDSPEED我非常感謝你的幫助。在使用df1.T.reset_index（）。T之後，我在筆記本中看到的結果只是一個事實，沒有第一行索引0,1,2,3，即'Name'，'Unit'， .etc是作爲df1.columns.values返回的結果。 – John

與合併的另一種方法和刪除重複和ffill即

new_df = df1.merge(df2,on=[0],how='outer').T.set_index(0).sort_index() 
     .ffill().reset_index().drop_duplicates(0,keep='last').T.dropna()

 
      0  2  3  5 
0 Attribute Date Name Unit 
1   1 2019  a  F 
2   2 2020  b  G 
3   3 2016  c  C 
4   4 2017  d  D 
5   5 2021  e  H

說明

df1.merge(df2,on=[0],how='outer').T.set_index(0).sort_index()

換位數據框會給數據幀，使得我們可以應用填寫以填寫nan值

 
      1  2  3  4  5  6 
0            
Attribute  1  2  3  4  5 NaN 
Date  2014 2015 2016 2017 2018 NaN 
Date  2019 2020 NaN NaN 2021 2022 
Name   a  b  c  d  e  f 
Unit   A  B  C  D  E NaN 
Unit   F  G NaN NaN  H  I

.ffill().reset_index().drop_duplicates(0,keep='last')

這將填補南與以前行的數據，並與子集0的下降重複reset_index值，並保持最後將保持完全填滿一行。

 
     0  1  2  3  4  5  6 
0 Attribute  1  2  3  4  5 NaN 
2  Date 2019 2020 2016 2017 2021 2022 
3  Name  a  b  c  d  e  f 
5  Unit  F  G  C  D  H  I

.T.dropna()

這將旋轉數據幀刪除與NaN值導致期望的輸出行。

來源

2017-09-15 13:33:43 Dark

我還發現下面的代碼做了我想要的，並且比兩個for循環快得多。

df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]].values

來源

2017-09-18 11:40:43 John

基於列的子集合並和更新數據幀

回答

相關問題