2017-09-19 65 views
2

我正在尋找一種方法來獲取熊貓系列並返回新系列,該系列表示之前連續值的數量高於/低於系列中的每一行:pandas - 連續數值高於/低於當前行數

a = pd.Series([30, 10, 20, 25, 35, 15]) 

...應該輸出:

Value Higher than streak Lower than streak 
30  0     0 
10  0     1 
20  1     0 
25  2     0 
35  4     0 
15  0     3 

這將允許有人來識別每個「區域最大/最小」的價值是多麼重要的時間序列。

在此先感謝。

回答

2

,你將不得不以某種方式與指數進行交互。此解決方案首先查看當前索引處的值之前的任何值,以查看它們是否小於或大於該值,然後將任何值設置爲False,其後面有False。它還避免了在DataFrame上創建迭代器,這可能會加速大數據集的操作。

import pandas as pd 
from operator import gt, lt 

a = pd.Series([30, 10, 20, 25, 35, 15]) 

def consecutive_run(op, ser, i): 
    """ 
    Sum the uninterrupted consecutive runs at index i in the series where the previous data 
    was true according to the operator. 
    """ 
    thresh_all = op(ser[:i], ser[i]) 
    # find any data where the operator was not passing. set the previous data to all falses 
    non_passing = thresh_all[~thresh_all] 
    start_idx = 0 
    if not non_passing.empty: 
     # if there was a failure, there was a break in the consecutive truth values, 
     # so get the final False position. Starting index will be False, but it 
     # will either be at the end of the series selection and will sum to zero 
     # or will be followed by all successive True values afterwards 
     start_idx = non_passing.index[-1] 
    # count the consecutive runs by summing from the start index onwards 
    return thresh_all[start_idx:].sum() 


res = pd.concat([a, a.index.to_series().map(lambda i: consecutive_run(gt, a, i)), 
       a.index.to_series().map(lambda i: consecutive_run(lt, a, i))], 
     axis=1) 
res.columns = ['Value', 'Higher than streak', 'Lower than streak'] 
print(res) 

結果:

Value Higher than streak Lower than streak 
0  30     0     0 
1  10     1     0 
2  20     0     1 
3  25     0     2 
4  35     0     4 
5  15     3     0 
+1

謝謝,我不認爲我們會找到避免循環的解決方案。 –

+0

更新爲使用稍微更有效的求和算法,只需抓取接近的值,然後求和即可。 – benjwadams

0
import pandas as pd 
import numpy as np 

value = pd.Series([30, 10, 20, 25, 35, 15]) 



Lower=[(value[x]<value[:x]).sum() for x in range(len(value))] 
Higher=[(value[x]>value[:x]).sum() for x in range(len(value))] 


df=pd.DataFrame({"value":value,"Higher":Higher,"Lower":Lower}) 

print(df) 





     Lower Higher value 
0  0  0  30 
1  1  0  10 
2  1  1  20 
3  1  2  25 
4  0  4  35 
5  4  1  15 
+0

謝謝你的答案。不幸的是,這個解決方案並沒有達到我預期的結果,因爲每行只能對它之前的行進行評估。例如從第二個觀察結果來看,10低於30 - 因此Lower column = 1,Upper column = 0. –

+0

已編輯我的答案 – 2Obe

+0

也許您必須根據您認爲的邏輯更改名稱更高和更低 – 2Obe

0

編輯:更新後真正計數連續值。我無法想出一個可行的熊貓解決方案,因此我們又回到了循環。

df = pd.Series(np.random.rand(10000)) 

def count_bigger_consecutives(values): 
    length = len(values) 
    result = np.zeros(length) 
    for i in range(length): 
    for j in range(i): 
     if(values[i]>values[j]): 
     result[i] += 1 
     else: 
     break 
    return result 

%timeit count_bigger_consecutives(df.values) 
1 loop, best of 3: 365 ms per loop 

如果性能是你所關心它是可能的numba,公正,及時編譯器爲Python代碼歸檔加速。而在這個例子中,你真的能看到numba閃耀:

from numba import jit 
@jit(nopython=True) 
def numba_count_bigger_consecutives(values): 
    length = len(values) 
    result = np.zeros(length) 
    for i in range(length): 
    for j in range(i): 
     if(values[i]>values[j]): 
     result[i] += 1 
     else: 
     break 
    return result 

%timeit numba_count_bigger_consecutives(df.values) 
The slowest run took 543.09 times longer than the fastest. This could mean that an intermediate result is being cached. 
10000 loops, best of 3: 161 µs per loop 
+0

謝謝。非常有趣,我不熟悉expand()。但是,這不完全是預期的行爲。我需要知道在我的時間序列中連續過去的觀察值的最大數目,它仍然會使當前行= max()或min()。 –

+0

@BrunoVieira我更新了我的解決方案。 –

+0

哇。這要快得多。感謝分享這個解決方案。不幸的是,結果出現爲數組([0.,0,0,0,0.4,0。]),而我期望0,0,1,2,4,0。因爲它看起來像解決方案仍然需要一個循環,你使用numba的建議仍然非常有用。 –

0

這裏有一個同事想出了一個解決方案(可能不是最有效的,但它的伎倆):

輸入數據

a = pd.Series([30, 10, 20, 25, 35, 15]) 

創建 '更高' 列

b = [] 

for idx, value in enumerate(a): 
    count = 0 
    for i in range(idx, 0, -1): 
     if value < a.loc[i-1]: 
      break 
     count += 1 
    b.append([value, count]) 

higher = pd.DataFrame(b, columns=['Value', 'Higher']) 

創建 '下' 列

c = [] 

for idx, value in enumerate(a): 
    count = 0 
    for i in range(idx, 0, -1): 
     if value > a.loc[i-1]: 
      break 
     count += 1 
    c.append([value, count]) 

lower = pd.DataFrame(c, columns=['Value', 'Lower']) 

合併這兩個新系列

print(pd.merge(higher, lower, on='Value')) 

    Value Higher Lower 
0  30  0  0 
1  10  0  1 
2  20  1  0 
3  25  2  0 
4  35  4  0 
5  15  0  3 
1

這是我的解決方案 - 它有一個循環,但迭代的次數只會是最大連勝長度。它存儲了每行的條紋是否已計算的狀態,並在完成時停止。它使用移位來測試前一行是否更高/更低,並繼續增加移位直到找到所有條紋。

a = pd.Series([30, 10, 20, 25, 35, 15, 15]) 

a_not_done_greater = pd.Series(np.ones(len(a))).astype(bool) 
a_not_done_less = pd.Series(np.ones(len(a))).astype(bool) 

a_streak_greater = pd.Series(np.zeros(len(a))).astype(int) 
a_streak_less = pd.Series(np.zeros(len(a))).astype(int) 

s = 1 
not_done_greater = True 
not_done_less = True 

while not_done_greater or not_done_less: 
    if not_done_greater: 
     a_greater_than_shift = (a > a.shift(s)) 
     a_streak_greater = a_streak_greater + (a_not_done_greater.astype(int) * a_greater_than_shift) 
     a_not_done_greater = a_not_done_greater & a_greater_than_shift 
     not_done_greater = a_not_done_greater.any() 

    if not_done_less: 
     a_less_than_shift = (a < a.shift(s)) 
     a_streak_less = a_streak_less + (a_not_done_less.astype(int) * a_less_than_shift) 
     a_not_done_less = a_not_done_less & a_less_than_shift 
     not_done_less = a_not_done_less.any() 

    s = s + 1 


res = pd.concat([a, a_streak_greater, a_streak_less], axis=1) 
res.columns = ['value', 'greater_than_streak', 'less_than_streak'] 
print(res) 

既然你在以前的值向後看,看是否有連續的值,這給數據框

value greater_than_streak less_than_streak 
0  30     0     0 
1  10     0     1 
2  20     1     0 
3  25     2     0 
4  35     4     0 
5  15     0     3 
6  15     0     0 
相關問題