2016-05-03 27 views
1

我有一個CSV每月手機賬單的文件沒有特別的順序,我讀入PandasDataframe。我想爲每個帳單添加一列,顯示與同一帳戶的以前帳單有多少差異。這個CSV只是我的數據的一個子集。我的代碼工作正常,但是當您查看接近一百萬行的CSV文件時,它非常潦草,速度很慢。在Pandas Dataframe中有效比較跨行的數據

我該怎麼做才能使這個效率更高?

CSV:

Account Number,Bill Month,Bill Amount 
4543,3/1/2015,300 
4543,1/1/2015,100 
4543,2/1/2015,200 
2322,1/1/2015,22 
2322,3/1/2015,38 
2322,2/1/2015,25 

的Python:

import numpy as np 
import pandas as pd 
data = pd.read_csv('data.csv', low_memory=False) 

# sort my data and reset the index so I can use index and index - 1 in the loop 
data = data.sort_values(by=['Account Number', 'Bill Month']) 
data = data.reset_index(drop=True) 

# add a blank column for the difference 
data['Difference'] = np.nan 

for index, row in data.iterrows(): 

    # special handling for the first row so I don't get negative indexes 
    if index == 0: 
     data.ix[index, 'Difference'] = "-" 
    else: 
     # if the account in the current row and the row before are the same, then compare Bill Amounts 
     if data.ix[index, 'Account Number'] == data.ix[index - 1, 'Account Number']: 
      data.ix[index, 'Difference'] = data.ix[index, 'Bill Amount'] - data.ix[index - 1, 'Bill Amount'] 
     else: 
      data.ix[index, 'Difference'] = "-" 

print data 

所需的輸出:

Account Number Bill Month Bill Amount Difference 
0   2322 1/1/2015   22   - 
1   2322 2/1/2015   25   3 
2   2322 3/1/2015   38   13 
3   4543 1/1/2015   100   - 
4   4543 2/1/2015   200  100 
5   4543 3/1/2015   300  100 

回答

1

試試這個:

In [37]: df = df.sort_values(['Account Number','Bill Month']) 

In [38]: df['Difference'] = (df.groupby(['Account Number'])['Bill Amount'] 
    ....:      .diff() 
    ....:      .fillna('-') 
    ....:     ) 

In [39]: df 
Out[39]: 
    Account Number Bill Month Bill Amount Difference 
3   2322 2015-01-01   22   - 
5   2322 2015-02-01   25   3 
4   2322 2015-03-01   38   13 
1   4543 2015-01-01   100   - 
2   4543 2015-02-01   200  100 
0   4543 2015-03-01   300  100 

說明:

diff()將分別應用於每個組 - 它會返回「下一個」值與當前值之間的差異:

In [123]: df.groupby(['Account Number'])['Bill Amount'].diff() 
Out[123]: 
3  NaN 
5  3.0 
4  13.0 
1  NaN 
2 100.0 
0 100.0 
dtype: float64 

fillna('-') - 填滿所有NaN與指定值: -

In [124]: df.groupby(['Account Number'])['Bill Amount'].diff().fillna('-') 
Out[124]: 
3  - 
5  3 
4  13 
1  - 
2 100 
0 100 
dtype: object 
+0

謝謝,這太乾淨了!你可以添加一個快速解釋如何df.diff()知道哪些值減去值,何時應用'fillna()' – user2242044

+0

@ user2242044,我已經給我的答案添加了一個解釋 - 請檢查 – MaxU

+0

感謝歡迎您光臨 – user2242044

1
df = pd.DataFrame({ 
    'Account Number': {0: 4543, 1: 4543, 2: 4543, 3: 2322, 4: 2322, 5: 2322}, 
    'Bill Amount': {0: 300.0, 1: 100.0, 2: 200.0, 3: 22.0, 4: 38.0, 5: 25.0}, 
    'Bill Month': { 
     0: pd.Timestamp('2015-03-01 00:00:00'), 
     1: pd.Timestamp('2015-01-01 00:00:00'), 
     2: pd.Timestamp('2015-02-01 00:00:00'), 
     3: pd.Timestamp('2015-01-01 00:00:00'), 
     4: pd.Timestamp('2015-03-01 00:00:00'), 
     5: pd.Timestamp('2015-02-01 00:00:00')}} 

您可以在賬戶號碼和賬單一個月(這種種默認)組,求和賬單金額(或者如果你保證每月只有一個賬單,就拿第一個),在指數的第一級別(賬戶號碼)再次分組,並且使用diff取得差異。

>>> (df.groupby(['Account Number', 'Bill Month'])['Bill Amount'] 
     .sum() 
     .groupby(level=0) 
     .diff()) 
Account Number Bill Month 
2322   2015-01-01 NaN 
       2015-02-01  3 
       2015-03-01  13 
4543   2015-01-01 NaN 
       2015-02-01 100 
       2015-03-01 100 
+0

謝謝,這很好。在這個例子中,Account Number現在是索引列嗎? – user2242044

+0

賬戶號碼和賬單月份。 – Alexander

相關問題