意想不到的慢的cython卷積代碼

我需要快速計算一個矩陣，其條目是通過用每個行的向量卷積濾波器獲得的，對結果向量的條目進行二次取樣，然後將結果的點積與另一個向量。具體而言，我想計算意想不到的慢的cython卷積代碼

M = [CONV（e_j，F）* P_I * V_I] _ {I，J}，

其中i從1變化到n和j從1到m而變化。這裏e_j是大小爲n的指標（行）向量，其中只有列j中的一個，f是長度爲s的濾波器，P_i是（n + s-1）個由k個矩陣，卷積，並且v_i是長度爲k的列向量。對於n = 6，m = 7，s = 3，則需要O（n * s）個操作來計算M的每個條目，因此O（n * s * n * m）我的電腦核心（8GLOPs）應該能夠在0.94微秒內計算出M值。然而，我的非常簡單的cython實現，在example given in the Cython documentation之後，需要超過2毫秒才能用這些參數計算一個例子。這差不多有4個數量級！

這是一個使用Cython實現和測試代碼的shar文件。將其複製並粘貼到一個文件中，然後在乾淨的目錄中運行'bash <fname>'以獲取代碼，然後運行'bash ./test.sh'以查看糟糕的性能。

cat > fastcalcM.pyx <<'EOF' 

import numpy as np 
cimport numpy as np 
cimport cython 
from scipy.signal import convolve 

DTYPE=np.float32 
ctypedef np.float32_t DTYPE_t 

@cython.boundscheck(False) 
def calcM(np.ndarray[DTYPE_t, ndim=1, negative_indices=False] filtertaps, int 
     n, int m, np.ndarray[np.int_t, ndim=2, negative_indices=False] 
     keep_indices, np.ndarray[DTYPE_t, ndim=2, negative_indices=False] V): 
    """ Computes a numrows-by-k matrix M whose entries satisfy 
     M_{i,k} = [conv(e_j, f)^T * P_i * v_i], 
     where v_i^T is the i-th row of V, and P_i samples the entries from 
     conv(e_j, f)^T indicated by the ith row of the keep_indices matrix """ 

    cdef int k = keep_indices.shape[1] 

    cdef np.ndarray M = np.zeros((n, m), dtype=DTYPE) 
    cdef np.ndarray ej = np.zeros((m,), dtype=DTYPE) 
    cdef np.ndarray convolution 
    cdef int rowidx, colidx, kidx 

    for rowidx in range(n): 
     for colidx in range(m): 
      ej[colidx] = 1 
      convolution = convolve(ej, filtertaps, mode='full') 
      for kidx in range(k): 
       M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx] 
      ej[colidx] = 0 

    return M 

EOF 
#----------------------------------------------------------------------------- 
cat > test_calcM.py << 'EOF' 

import numpy as np 
from fastcalcM import calcM 

filtertaps = np.array([-1, 2, -1]).astype(np.float32) 
n, m = 6, 7 
keep_indices = np.array([[1, 3], 
         [4, 5], 
         [2, 2], 
         [5, 5], 
         [3, 4], 
         [4, 5]]).astype(np.int) 
V = np.random.random_integers(-5, 5, size=(6, 2)).astype(np.float32) 

print calcM(filtertaps, n, m, keep_indices, V) 

EOF 
#----------------------------------------------------------------------------- 
cat > test.sh << 'EOF' 

python setup.py build_ext --inplace 
echo -e "%run test_calcM\n%timeit calcM(filtertaps, n, m, keep_indices, V)" > script.ipy 
ipython script.ipy 

EOF 
#----------------------------------------------------------------------------- 
cat > setup.py << 'EOF' 

from distutils.core import setup 
from Cython.Build import cythonize 
import numpy 

setup(
    name="Fast convolutions", 
    include_dirs = [numpy.get_include()], 
    ext_modules = cythonize("fastcalcM.pyx") 
) 

EOF

我想也許調用SciPy的的卷積可能是罪魁禍首（我不能肯定，用Cython和SciPy的發揮得很好），所以我實現我自己的卷積碼ALA用Cython文件中相同的例子，但這導致整體代碼慢10倍左右。

關於如何接近理論上可能的速度的任何想法，或差異如此之大的原因？

來源

2014-05-14 AatG

首先，打字M,eg和convolution不允許快速索引。實際上，你所做的打字並不是特別有用。

但沒關係，因爲你有兩個開銷。首先是在Cython和Python類型之間進行轉換。如果你想將它們傳遞給Python，你應該保留無類型的數組，以防止轉換。僅僅因爲這個原因（1ms→0.65μs）將它移到Python上就會加快速度。

然後我異形它：

Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
    15           def calcM(filtertaps, n, m, keep_indices, V): 
    16  4111   3615  0.9  0.1  k = keep_indices.shape[1] 
    17  4111   8024  2.0  0.1  M = np.zeros((n, m), dtype=np.float32) 
    18  4111   6090  1.5  0.1  ej = np.zeros((m,), dtype=np.float32) 
    19           
    20  28777  18690  0.6  0.3  for rowidx in range(n): 
    21 197328  123284  0.6  2.2   for colidx in range(m): 
    22 172662  112348  0.7  2.0    ej[colidx] = 1 
    23 172662  4076225  23.6  73.6    convolution = convolve(ej, filtertaps, mode='full') 
    24 517986  395513  0.8  7.1    for kidx in range(k): 
    25 345324  668309  1.9  12.1     M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx] 
    26 172662  120271  0.7  2.2    ej[colidx] = 0 
    27           
    28  4111   2374  0.6  0.0  return M

在您考慮任何其他，處理convolve。

爲什麼convolve變慢？那麼，它有很多開銷。 numpy/scipy通常會;對於大型數據集最好。如果你知道陣列的大小會保持很小，只需在Cython中重新實現convolve即可。

哦，嘗試使用緩衝區語法。對於2D陣列使用DTYPE[:, :]，對於1D陣列使用DTYPE[:]，等等。它是memoryview協議，它更好。有些情況下，它有更多的開銷，但這些通常可以解決，而且在大多數其他方面都更好。

編輯：

你可以嘗試（遞歸）內聯函數scipy：

import numpy as np 
from scipy.signal.sigtools import _correlateND 

def calcM(filtertaps, n, m, keep_indices, V): 
    k = keep_indices.shape[1] 
    M = np.zeros((n, m), dtype=np.float32) 
    ej = np.zeros((m,), dtype=np.float32) 

    slice_obj = [slice(None, None, -1)] * len(filtertaps.shape) 
    sliced_filtertaps_view = filtertaps[slice_obj] 

    ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1 
    in1zpadded = np.zeros(ps, ej.dtype) 
    out = np.empty(ps, ej.dtype) 

    for rowidx in range(n): 
     for colidx in range(m): 
      in1zpadded[colidx] = 1 

      convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2) 

      for kidx in range(k): 
       M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx] 

      in1zpadded[colidx] = 0 

    return M

注意，這裏使用私有的實現細節。

這是針對特定尺寸量身定製的，所以我不知道它是否適用於您的實際數據。但它消除了絕大部分的開銷。然後，您可以通過再次鍵入事情改善這一點：

import numpy as np 
cimport numpy as np 
from scipy.signal.sigtools import _correlateND 

DTYPE=np.float32 
ctypedef np.float32_t DTYPE_t 

def calcM(filtertaps, int n, int m, np.int_t[:, :] t_keep_indices, DTYPE_t[:, :] t_V): 
    cdef int rowidx, colidx, kidx, k 
    cdef DTYPE_t[:, :] t_M 
    cdef DTYPE_t[:] t_in1zpadded, t_convolution 

    k = t_keep_indices.shape[1] 
    t_M = M = np.zeros((n, m), dtype=np.float32) 
    ej = np.zeros((m,), dtype=np.float32) 

    slice_obj = [slice(None, None, -1)] * len(filtertaps.shape) 
    sliced_filtertaps_view = filtertaps[slice_obj] 

    ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1 
    t_in1zpadded = in1zpadded = np.zeros(ps, ej.dtype) 
    out = np.empty(ps, ej.dtype) 

    for rowidx in range(n): 
     for colidx in range(m): 
      t_in1zpadded[colidx] = 1 

      t_convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2) 

      for kidx in range(k): 
       t_M[rowidx, colidx] += t_convolution[<int>t_keep_indices[rowidx, kidx]] * t_V[rowidx, kidx] 

      t_in1zpadded[colidx] = 0 

    return M

這是10倍以上的速度，但並不如你烏托邦式的夢想的天空估計。然後，再次，計算是有點虛假開始;）。

來源

2014-05-14 02:06:16 Veedrac

感謝您的細節！我不明白關於轉換類型或內存視圖語法的一點，但是一般來說，這聽起來很重要且有用;我會查找它們，實施你的建議，並可能最終接受這個答案。順便說一句，你會如何改善時間估計？任何指導方針都會有很大的幫助：我試圖養成一個習慣，弄清楚我的代碼和最佳可能之間的差距。我能想到的唯一被省略的因素是內存訪問成本（這可能會僞造我的估計）。 – AatG

你通常不能只是猜測那樣的時間。首先，「O（nm）」與「nm」不同。另外，操作速度也有很大的不同，而且大量的開銷是不可避免的。我的猜測是'_correlateND'交易速度的準確性。你總是可以嘗試自己實現它（或從頭開始）。 – Veedrac

意想不到的慢的cython卷積代碼

回答

相關問題