我有一個CSV
每月手機賬單的文件沒有特別的順序,我讀入Pandas
Dataframe
。我想爲每個帳單添加一列,顯示與同一帳戶的以前帳單有多少差異。這個CSV只是我的數據的一個子集。我的代碼工作正常,但是當您查看接近一百萬行的CSV文件時,它非常潦草,速度很慢。在Pandas Dataframe中有效比較跨行的數據
我該怎麼做才能使這個效率更高?
CSV:
Account Number,Bill Month,Bill Amount
4543,3/1/2015,300
4543,1/1/2015,100
4543,2/1/2015,200
2322,1/1/2015,22
2322,3/1/2015,38
2322,2/1/2015,25
的Python:
import numpy as np
import pandas as pd
data = pd.read_csv('data.csv', low_memory=False)
# sort my data and reset the index so I can use index and index - 1 in the loop
data = data.sort_values(by=['Account Number', 'Bill Month'])
data = data.reset_index(drop=True)
# add a blank column for the difference
data['Difference'] = np.nan
for index, row in data.iterrows():
# special handling for the first row so I don't get negative indexes
if index == 0:
data.ix[index, 'Difference'] = "-"
else:
# if the account in the current row and the row before are the same, then compare Bill Amounts
if data.ix[index, 'Account Number'] == data.ix[index - 1, 'Account Number']:
data.ix[index, 'Difference'] = data.ix[index, 'Bill Amount'] - data.ix[index - 1, 'Bill Amount']
else:
data.ix[index, 'Difference'] = "-"
print data
所需的輸出:
Account Number Bill Month Bill Amount Difference
0 2322 1/1/2015 22 -
1 2322 2/1/2015 25 3
2 2322 3/1/2015 38 13
3 4543 1/1/2015 100 -
4 4543 2/1/2015 200 100
5 4543 3/1/2015 300 100
謝謝,這太乾淨了!你可以添加一個快速解釋如何df.diff()知道哪些值減去值,何時應用'fillna()' – user2242044
@ user2242044,我已經給我的答案添加了一個解釋 - 請檢查 – MaxU
感謝歡迎您光臨 – user2242044