2016-02-11 75 views
3

我一直在試圖優化一段涉及大型多維數組計算的python代碼。我得到了與倫巴相違背的結果。我在MBP上運行,2015年年中,2.5 GHz i7 quadcore,OS 10.10.5,python 2.7.11。考慮以下幾點:numba guvectorize target ='parallel'slow than target ='cpu'

import numpy as np 
from numba import jit, vectorize, guvectorize 
import numexpr as ne 
import timeit 

def add_two_2ds_naive(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@jit 
def add_two_2ds_jit(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@guvectorize(['float64[:,:],float64[:,:],float64[:,:]'], 
    '(n,m),(n,m)->(n,m)',target='cpu') 
def add_two_2ds_cpu(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'], 
    '(n,m),(n,m)->(n,m)',target='parallel') 
def add_two_2ds_parallel(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

def add_two_2ds_numexpr(A,B,res): 
    res = ne.evaluate('A+B') 

if __name__=="__main__": 
    np.random.seed(69) 
    A = np.random.rand(10000,100) 
    B = np.random.rand(10000,100) 
    res = np.zeros((10000,100)) 

我現在可以在各種功能運行timeit:

%timeit add_two_2ds_jit(A,B,res) 
1000 loops, best of 3: 1.16 ms per loop 

%timeit add_two_2ds_cpu(A,B,res) 
1000 loops, best of 3: 1.19 ms per loop 

%timeit add_two_2ds_parallel(A,B,res) 
100 loops, best of 3: 6.9 ms per loop 

%timeit add_two_2ds_numexpr(A,B,res) 
1000 loops, best of 3: 1.62 ms per loop 

看來,使用大多數單核的「水貨」沒有再碰,因爲它的使用情況top顯示python的「並行」命中〜40%cpu,「cpu」約爲100%,並且命中達到〜300%。

+0

但是'guvectorize'的意義在於你定義的操作被應用在任何_extra_維度上(這將是並行完成的)。您編寫的代碼不會自行並行。因此,如果將'A','B'和'res'更改爲形狀'(10000,100,100)',則第三維的100個不同迭代將並行運行。 – DavidW

+0

謝謝,我看到我誤解了用法。 –

回答

5

您的@guvectorize實現有兩個問題。首先是你正在做@guvectorize內核的所有循環,所以Numba並行目標實際上沒有什麼並行化。在ufunc/gufunc中,@vectorize和@guvectorize都在廣播維度上並行化。由於你的gufunc的簽名是2D的,而你的輸入是2D的,所以對內部函數只有一次調用,這就解釋了你看到的CPU佔用率只有100%。

寫你有以上的功能,最好的方法是使用常規的ufunc:

@vectorize('(float64, float64)', target='parallel') 
def add_ufunc(a, b): 
    return a + b 

然後我的系統上,我看到這樣的速度:

%timeit add_two_2ds_jit(A,B,res) 
1000 loops, best of 3: 1.87 ms per loop 

%timeit add_two_2ds_cpu(A,B,res) 
1000 loops, best of 3: 1.81 ms per loop 

%timeit add_two_2ds_parallel(A,B,res) 
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.43 ms per loop 

%timeit add_two_2ds_numexpr(A,B,res) 
100 loops, best of 3: 2.79 ms per loop 

%timeit add_ufunc(A, B, res) 
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 2.03 ms per loop 

(這是一個非常類似的OS X系統給你,但與OS X 10.11。)

儘管Numba的並行ufunc現在擊敗numexpr(我看到add_ufunc使用大約280%的CPU),它不擊敗si多線程單CPU的情況下。我懷疑瓶頸是由於內存(或緩存)帶寬,但我沒有做過測量來檢查。

一般來說,如果您對每個內存元素進行更多的數學運算(比如說餘弦),您將會從並行ufunc目標中看到更多的好處。