numba guvectorize target ='parallel'slow than target ='cpu'

我一直在試圖優化一段涉及大型多維數組計算的python代碼。我得到了與倫巴相違背的結果。我在MBP上運行，2015年年中，2.5 GHz i7 quadcore，OS 10.10.5，python 2.7.11。考慮以下幾點：numba guvectorize target ='parallel'slow than target ='cpu'

import numpy as np 
from numba import jit, vectorize, guvectorize 
import numexpr as ne 
import timeit 

def add_two_2ds_naive(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@jit 
def add_two_2ds_jit(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@guvectorize(['float64[:,:],float64[:,:],float64[:,:]'], 
    '(n,m),(n,m)->(n,m)',target='cpu') 
def add_two_2ds_cpu(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'], 
    '(n,m),(n,m)->(n,m)',target='parallel') 
def add_two_2ds_parallel(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

def add_two_2ds_numexpr(A,B,res): 
    res = ne.evaluate('A+B') 

if __name__=="__main__": 
    np.random.seed(69) 
    A = np.random.rand(10000,100) 
    B = np.random.rand(10000,100) 
    res = np.zeros((10000,100))

我現在可以在各種功能運行timeit：

%timeit add_two_2ds_jit(A,B,res) 
1000 loops, best of 3: 1.16 ms per loop 

%timeit add_two_2ds_cpu(A,B,res) 
1000 loops, best of 3: 1.19 ms per loop 

%timeit add_two_2ds_parallel(A,B,res) 
100 loops, best of 3: 6.9 ms per loop 

%timeit add_two_2ds_numexpr(A,B,res) 
1000 loops, best of 3: 1.62 ms per loop

看來，使用大多數單核的「水貨」沒有再碰，因爲它的使用情況top顯示python的「並行」命中〜40％cpu，「cpu」約爲100％，並且命中達到〜300％。

來源

2016-02-11 Brian Pollack

但是'guvectorize'的意義在於你定義的操作被應用在任何_extra_維度上（這將是並行完成的）。您編寫的代碼不會自行並行。因此，如果將'A'，'B'和'res'更改爲形狀'（10000,100,100）'，則第三維的100個不同迭代將並行運行。 – DavidW

謝謝，我看到我誤解了用法。 –

您的@guvectorize實現有兩個問題。首先是你正在做@guvectorize內核的所有循環，所以Numba並行目標實際上沒有什麼並行化。在ufunc/gufunc中，@vectorize和@guvectorize都在廣播維度上並行化。由於你的gufunc的簽名是2D的，而你的輸入是2D的，所以對內部函數只有一次調用，這就解釋了你看到的CPU佔用率只有100％。

寫你有以上的功能，最好的方法是使用常規的ufunc：

@vectorize('(float64, float64)', target='parallel') 
def add_ufunc(a, b): 
    return a + b

然後我的系統上，我看到這樣的速度：

%timeit add_two_2ds_jit(A,B,res) 
1000 loops, best of 3: 1.87 ms per loop 

%timeit add_two_2ds_cpu(A,B,res) 
1000 loops, best of 3: 1.81 ms per loop 

%timeit add_two_2ds_parallel(A,B,res) 
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.43 ms per loop 

%timeit add_two_2ds_numexpr(A,B,res) 
100 loops, best of 3: 2.79 ms per loop 

%timeit add_ufunc(A, B, res) 
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 2.03 ms per loop

（這是一個非常類似的OS X系統給你，但與OS X 10.11。）

儘管Numba的並行ufunc現在擊敗numexpr（我看到add_ufunc使用大約280％的CPU），它不擊敗si多線程單CPU的情況下。我懷疑瓶頸是由於內存（或緩存）帶寬，但我沒有做過測量來檢查。

一般來說，如果您對每個內存元素進行更多的數學運算（比如說餘弦），您將會從並行ufunc目標中看到更多的好處。

來源

2016-02-11 22:43:02

numba guvectorize target ='parallel'slow than target ='cpu'

回答

相關問題