2017-04-26 43 views
4

我想在一些需要同時處理兩列的熊貓中進行一些滾動窗口計算。我會採取一個簡單的例子清楚地表達這個問題:如何訪問滾動操作員中的多列?

import pandas as pd 

df = pd.DataFrame({ 
    'x': [1, 2, 3, 2, 1, 5, 4, 6, 7, 9], 
    'y': [4, 3, 4, 6, 5, 9, 1, 3, 1, 2] 
}) 

windowSize = 4 
result = [] 

for i in range(1, len(df)+1): 
    if i < windowSize: 
     result.append(None) 
    else: 
     x = df.x.iloc[i-windowSize:i] 
     y = df.y.iloc[i-windowSize:i] 
     m = y.mean() 
     r = sum(x[y > m])/sum(x[y <= m]) 
     result.append(r) 

print(result) 

有沒有在大熊貓循環任何方式來解決這個問題?任何幫助表示讚賞

回答

1

下面是使用NumPy工具之一量化方法 -

windowSize = 4 
a = df.values 
X = strided_app(a[:,0],windowSize,1) 
Y = strided_app(a[:,1],windowSize,1) 
M = Y.mean(1) 
mask = Y>M[:,None] 
sums = np.einsum('ij,ij->i',X,mask) 
rest_sums = X.sum(1) - sums 
out = sums/rest_sums 

strided_apphere拍攝。

運行測試 -

途徑 -

# @kazemakase's solution 
def rolling_window_sum(df, windowSize=4): 
    rw = rolling_window(df.values.T, windowSize) 
    m = np.mean(rw[1], axis=-1, keepdims=True) 
    a = np.sum(rw[0] * (rw[1] > m), axis=-1) 
    b = np.sum(rw[0] * (rw[1] <= m), axis=-1) 
    result = a/b 
    return result  

# Proposed in this post  
def strided_einsum(df, windowSize=4): 
    a = df.values 
    X = strided_app(a[:,0],windowSize,1) 
    Y = strided_app(a[:,1],windowSize,1) 
    M = Y.mean(1) 
    mask = Y>M[:,None] 
    sums = np.einsum('ij,ij->i',X,mask) 
    rest_sums = X.sum(1) - sums 
    out = sums/rest_sums 
    return out 

計時 -

In [46]: df = pd.DataFrame(np.random.randint(0,9,(1000000,2))) 

In [47]: %timeit rolling_window_sum(df) 
10 loops, best of 3: 90.4 ms per loop 

In [48]: %timeit strided_einsum(df) 
10 loops, best of 3: 62.2 ms per loop 

爲了更表現擠,我們可以計算出Y.mean(1)部分,這基本上是一個窗口總和與Scipy's 1D uniform filter 。因此,M可以交替計算爲windowSize=4爲 -

from scipy.ndimage.filters import uniform_filter1d as unif1d 

M = unif1d(a[:,1].astype(float),windowSize)[2:-1] 

的性能提升顯著 -

In [65]: %timeit strided_einsum(df) 
10 loops, best of 3: 61.5 ms per loop 

In [66]: %timeit strided_einsum_unif_filter(df) 
10 loops, best of 3: 49.4 ms per loop 
2

您可以使用rolling window trick for numpy arrays並將其應用到陣列的數據框的下面。

import pandas as pd 
import numpy as np 

def rolling_window(a, window): 
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window) 
    strides = a.strides + (a.strides[-1],) 
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides) 

df = pd.DataFrame({ 
    'x': [1, 2, 3, 2, 1, 5, 4, 6, 7, 9], 
    'y': [4, 3, 4, 6, 5, 9, 1, 3, 1, 2] 
}) 

windowSize = 4  

rw = rolling_window(df.values.T, windowSize) 
m = np.mean(rw[1], axis=-1, keepdims=True) 
a = np.sum(rw[0] * (rw[1] > m), axis=-1) 
b = np.sum(rw[0] * (rw[1] <= m), axis=-1) 
result = a/b 

結果缺乏導致None值,但應該很容易追加(以np.nan形式或的結果轉換爲一個列表之後)。

這可能不是你想要的,與熊貓一起工作,但它會完成沒有循環的工作。