矢量化前瞻性功能熊貓數據框

我想對熊貓中的DataFrame（可以認爲是一個Series）進行「奇怪」的計算。 DataFrame必須被視爲時間序列或類似的（元素的順序很重要）。矢量化前瞻性功能熊貓數據框

鑑於在指數[I]（值[I]）
給定一個步驟（例如1）[整數或實數]
給定乘數RR（例如2）[a值整數或實數]

查找向前在元素[I：]，並指定爲值[I]的一個「類」：

1如果EN告值之前達到的值[I] +步驟* RR 一個水平達到值[I] - 步驟

-1，如果隨後的值達到值的電平[Ⅰ] - 步驟* RR 之前到達值[i] +步驟

0在每個其他情況下（即當隨後的值觸摸值[i] - 步驟，然後值[i] + step或反之亦然。

我知道這聽起來很瘋狂。試想一下+ 1/-1步驟的隨機遊走。序列，如：

0，1，2將被分配到1類（它也可以是0，1，0，0，1，1，0，1，1，2）

0，-1，-2將被分配給等級-1（它也可以是0，-1,0,0,0，-1，-1，-1，-2）

0，+ 1， 0，-1或0，-1，0，0，-1，0，1等將類0

我已經解決了它的「經典」（也許不那麼Python的）通過定義一個函數的方式：

import numpy as np import pandas as pd def FindClass(inarr, i=0, step=0.001, rr=2): j = 0 foundClass = None while i+j < len(inarr) - 1: j += 1 if inarr[i+j] >= inarr[i] + step: direction = 1 break if inarr[i+j] <= inarr[i] - step: direction = -1 break while i+j < len(inarr)-1: j += 1 if direction == 1 and inarr[i+j] >= inarr[i] + (step * rr): foundClass = 1 break elif direction == 1 and inarr[i+j] <= inarr[i] - step: foundClass = 0 break elif direction == -1 and inarr[i+j] <= inarr[i] - (step * rr): foundClass = -1 break elif direction == -1 and inarr[i+j] >= inarr[i] + step: foundClass = 0 break if foundClass is None: foundClass = np.nan return foundClass

，然後遍歷它：

if __name__ == "__main__": steps = np.random.randint(-1, 2, size= 10000) randomwalk = steps.cumsum(0) rc = pd.DataFrame({'rw':randomwalk, 'result': np.nan}) for c in range(0, len(rc)-1): rc.result[c] = FindClass(rc.rw, i=c, step=1) print rc

我的筆記本電腦（和運行的Python 2.7），我收到了剖析，它是不是「太」壞的10000元素系列：

python -m cProfile -s cumulative fbmk.py <class 'pandas.core.frame.DataFrame'> Int64Index: 10000 entries, 0 to 9999 Data columns (total 2 columns): result 9996 non-null values rw 10000 non-null values dtypes: float64(1), int32(1) 932265 function calls (929764 primitive calls) in 2.643 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.106 0.106 2.646 2.646 fbmk.py:1(<module>) 9999 0.549 0.000 1.226 0.000 fbmk.py:4(FindClass) 158062 0.222 0.000 0.665 0.000 series.py:616(__getitem__) 2 0.029 0.014 0.561 0.281 __init__.py:3(<module>) 158062 0.226 0.000 0.443 0.000 index.py:718(get_value) 19998 0.070 0.000 0.442 0.000 frame.py:2082(__getattr__) 19998 0.111 0.000 0.331 0.000 frame.py:1986(__getitem__)

問題是：

有沒有人看到在pandas/numpy中以提高性能的方式向量化該函數的可能性？

如果事情在R中用較少的努力是可行的，那也不錯！

非常感謝！

來源

2014-04-25 user3562348

這不是矢量化，但也許你可以在cython中編寫函數'findClass'？ – joris

是的，當然這是一種可能性。這裏的問題主要是因爲這是一個逐行重複的任務，人們通常說，用熊貓和類似的東西，你必須「思考矢量」，避免循環......我試過了，但沒有管理！ – user3562348

除了我在回答中的想法之外，我認爲通過重寫函數以利用條件，您將獲得實質性的速度。你在一個while循環中放置了很長的條件，但是你的邏輯允許你在很多時候排除很多選項。這將導致更少的代碼執行，並可能在執行時間中獲得2-4倍。 –

根據問題的屬性，可以使用np.where來查找級別越過的位置並對時間序列進行分類。

這裏的一個很大的缺點是np.where會給你所有時間序列高於value[i] + step等的索引，這可能會將線性時間算法變成二次時間算法。根據你將要處理的問題的大小，我預計你會在前因子中獲得很多;你甚至可能會提前完成二次時間的numpy解決方案。

從左右徘徊，找到np.where的「找到第一個索引」相當於一個請求功能。

來源

2014-05-05 21:26:08

矢量化前瞻性功能熊貓數據框

回答

相關問題