我有一個檢查從一個DataFrame更改爲另一個的記錄的要求。它必須匹配全部列。大熊貓檢查平等太慢使用
一個是excel文件(new_df
),一個是SQL查詢(sql_df
)。形狀是〜20,000行×39列。我認爲這將是df.equals(other_df)
目前我使用下面的工作:
import pandas as pd
import numpy as np
new_df = pd.DataFrame({'ID' : [0 ,1, 2, 3, 4, 5, 6, 7, 8, 9],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : [10,0,30,50,0,0,4,10,1,3],
'D' : [1,0,3,4,0,0,7,8,0,1],
'E' : ['Universtiy of New York','New Hampshire University','JMU','Oklahoma State','Penn State',
'New Mexico Univ','Rutgers','Indiana State','JMU','University of South Carolina']})
sql_df= pd.DataFrame({'ID' : [0 ,1, 2, 3, 4, 5, 6, 7, 8, 9],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : [10,0,30,50,0,0,4,10,1,0],
'D' : [5,0,3,4,0,0,7,8,0,1],
'E' : ['Universtiy of New York','New Hampshire University','NYU','Oklahoma State','Penn State',
'New Mexico Univ','Rutgers','Indiana State','NYU','University of South Carolina']})
# creates an empty list to append to
differences = []
# for all the IDs in the dataframe that should not change check if this record is the same in the database
# must use reset_index() so the equals() will work as I expect it to
# if it is not the same, append to a list which has the Aspn ID that is failing, along with the columns that changed
for unique_id in new_df['ID'].tolist():
# get the id from the list, and filter both sql and new dfs to this record
if new_df.loc[new_df['ID'] == unique_id].reset_index(drop=True).equals(sql_df.loc[sql_df['ID'] == unique_id].reset_index(drop=True)) is False:
bad_columns = []
for column in new_df.columns.tolist():
# if not the same above, check which column using the same logic
if new_df.loc[new_df['ID'] == unique_id][column].reset_index(drop=True).equals(sql_df.loc[sql_df['ID'] == unique_id][column].reset_index(drop=True)) is False:
bad_columns.append(column)
differences.append([unique_id, bad_columns])
後來我拿differences
和bad_columns
和執行其它任務與他們。
有很多循環,我希望避免...因爲這可能是我的性能問題的原因。目前,20,000個記錄需要5分鐘以上(硬件上有所不同),這是糟糕的表現。我想加入/連接所有列成一個長字符串來比較,但這似乎是另一種低效率的方式。解決這個問題的更好方法是什麼?我怎樣才能避免將這種混亂添加到空列表解決方案中?
是什麼讓你覺得'等於'是罪魁禍首? – user2357112
@ user2357112 - 有效點。 *很容易*是循環的數量 - 我更新了標題以減少誤導這個 – MattR
來自'new_df'和'sql_df'的一個示例(或者看起來相似的東西)將極大地幫助提供工作解決方案。 – FabienP