2017-08-25 15 views
3

假設我有一個數據幀,看起來像如何推進基於過去的差異大熊貓恆定的總和填寫NaN的

internal_id time_id  membership 
a   2017-01-01 10 
a   2017-02-01 20 
a   2017-03-01 
a   2017-04-01 
b   2017-01-01 10 
b   2017-02-01 15 
b   2017-03-01 
b   2017-04-01 
c   2017-01-01 10 
c   2017-02-01 12 
c   2017-03-01 
c   2017-04-01 

我想轉發填補採用最新的DIFF每個internal_id。因此,例如,對於internal_id = 'a',我希望每次都向前加註10。對於'b'我希望每次都使用5來轉發填充。對於'c'我想每次都要前進填充2。所以,我最後的DF的樣子:

internal_id time_id  membership 
a   2017-01-01 10 
a   2017-02-01 20 
a   2017-03-01 30 
a   2017-04-01 40 
b   2017-01-01 10 
b   2017-02-01 15 
b   2017-03-01 20 
b   2017-04-01 25 
c   2017-01-01 10 
c   2017-02-01 12 
c   2017-03-01 14 
c   2017-04-01 16 

我想pd.ffill但似乎只能填寫使用的恆定值。有沒有一種方法可以用不變的總和來填充?

+0

添加定時到[我的回答(https://stackoverflow.com/a/45874258/4909087) - 爲所有有關各方。 –

回答

1

這裏是一個辦法做到這一點,我不是這個解決方案而感到自豪,但我會想辦法優化:

d2 = (df.groupby('internal_id') 
     .transform(lambda x: x.fillna(x.diff().max()) 
           .where(x.index != x.index.min()).cumsum())) 
df.combine_first(d2) 

輸出:

internal_id membership  time_id 
0   a  10.0 2017-01-01 
1   a  20.0 2017-02-01 
2   a  30.0 2017-03-01 
3   a  40.0 2017-04-01 
4   b  10.0 2017-01-01 
5   b  15.0 2017-02-01 
6   b  20.0 2017-03-01 
7   b  25.0 2017-04-01 
8   c  10.0 2017-01-01 
9   c  12.0 2017-02-01 
10   c  14.0 2017-03-01 
11   c  16.0 2017-04-01 
2

這裏有一個方法

In [38]: df.assign(membership=df.groupby('internal_id').membership 
      .transform(lambda x: x.ffill() + x.diff().shift().ffill().cumsum().fillna(0))) 
Out[38]: 
    internal_id  time_id membership 
0   a 2017-01-01  10.0 
1   a 2017-02-01  20.0 
2   a 2017-03-01  30.0 
3   a 2017-04-01  40.0 
4   b 2017-01-01  10.0 
5   b 2017-02-01  15.0 
6   b 2017-03-01  20.0 
7   b 2017-04-01  25.0 
8   c 2017-01-01  10.0 
9   c 2017-02-01  12.0 
10   c 2017-03-01  14.0 
11   c 2017-04-01  16.0 
2
from numba import njit 
import pandas as pd, numpy as np 

@njit 
def dfill(g, v, n): 
    d = np.arange(n) * np.nan 
    l = np.arange(n) * np.nan 
    c = np.arange(n) * np.nan 
    r = np.arange(g.size) * np.nan 
    for i in range(g.size): 
     x = g[i] 
     y = v[i] 
     if np.isnan(y): 
      c[x] += d[x] 
      r[i] = c[x] 
     else: 
      d[x] = y - l[x] 
      r[i] = l[x] = c[x] = y 
    return r 

g, u = pd.factorize(df.internal_id.values) 
df.assign(membership=dfill(g, df.membership.values, u.size)) 

    internal_id  time_id membership 
0   a 2017-01-01  10.0 
1   a 2017-02-01  20.0 
2   a 2017-03-01  30.0 
3   a 2017-04-01  40.0 
4   b 2017-01-01  10.0 
5   b 2017-02-01  15.0 
6   b 2017-03-01  20.0 
7   b 2017-04-01  25.0 
8   c 2017-01-01  10.0 
9   c 2017-02-01  12.0 
10   c 2017-03-01  14.0 
11   c 2017-04-01  16.0 

定時
下面的代碼

 pir   john  scott 
10 1.0 61.777804 69.453241 
30 1.0 173.136378 183.454086 
100 1.0 518.193661 589.781918 
300 1.0 1336.147050 1497.396670 
1000 1.0 2502.225163 2621.045714 

enter image description here

@njit 
def dfill(g, v, n): 
    d = np.arange(n) * np.nan 
    l = np.arange(n) * np.nan 
    c = np.arange(n) * np.nan 
    r = np.arange(g.size) * np.nan 
    for i in range(g.size): 
     x = g[i] 
     y = v[i] 
     if np.isnan(y): 
      c[x] += d[x] 
      r[i] = c[x] 
     else: 
      d[x] = y - l[x] 
      r[i] = l[x] = c[x] = y 
    return r 

def pir(d): 
    g, u = pd.factorize(d.internal_id.values) 
    return d.assign(membership=dfill(g, d.membership.values, u.size)) 

def john(d): 
    return d.assign(membership=d.groupby('internal_id').membership.transform(lambda x: x.ffill() + x.diff().shift().ffill().cumsum().fillna(0))) 

def scott(d): 
    d2 = (d.groupby('internal_id') 
      .transform(lambda x: x.fillna(x.diff().max()) 
            .where(x.index != x.index.min()).cumsum())) 
    return d.combine_first(d2) 

results = pd.DataFrame(
    index=[10, 30, 100, 300, 1000], 
    columns='pir john scott'.split(), 
    dtype=float 
) 

k = len(df) 
for i in results.index: 
    r = np.arange(1, i + 1).repeat(k).astype(str) 
    d = pd.concat([df] * i, ignore_index=True) 
    d.internal_id += r 
    for j in results.columns: 
     stmt = '{}(d)'.format(j) 
     setp = 'from __main__ import d, {}'.format(j) 
     results.at[i, j] = timeit(stmt, setp, number=10) 

(lambda r: r.div(r.min(1), 0))(results)