Numba函數比C++慢並且循環重新排序進一步變慢x10

以下代碼模擬從一組圖像中的不同位置提取二進制字。Numba函數比C++慢並且循環重新排序進一步變慢x10

的Numba在下面的代碼包裝的函數，wordcalc，有2個問題：

它是慢3倍相比，在C++中類似的實現。
最奇怪的是，如果你切換「廣積」和「IBIT」 for循環的順序，速度下降的10倍（！）。這在C++實現中不會發生，它不受影響。

我使用Numba從0.18.2 2.7 WinPython

可能是什麼造成的？

imDim = 80 
numInsts = 10**4 
numInstsSub = 10**4/4 
bitsNum = 13; 

Xs = np.random.rand(numInsts, imDim**2)  
iInstInds = np.array(range(numInsts)[::4]) 
baseInds = np.arange(imDim**2 - imDim*20 + 1) 
ofst1 = np.random.randint(0, imDim*20, bitsNum) 
ofst2 = np.random.randint(0, imDim*20, bitsNum) 

@nb.jit(nopython=True) 
def wordcalc(Xs, iInstInds, baseInds, ofst, bitsNum, newXz): 
    count = 0 
    for i in iInstInds: 
     Xi = Xs[i]   
     for ibit in range(bitsNum): 
      for ibase in range(baseInds.shape[0]):      
       u = Xi[baseInds[ibase] + ofst[0, ibit]] > Xi[baseInds[ibase] + ofst[1, ibit]] 
       newXz[count, ibase] = newXz[count, ibase] | np.uint16(u * (2**ibit)) 
     count += 1 
    return newXz 

ret = wordcalc(Xs, iInstInds, baseInds, np.array([ofst1, ofst2]), bitsNum, np.zeros((iInstInds.size, baseInds.size), dtype=np.uint16))

來源

2015-06-19 Leo

我認爲切換循環順序時的性能差異與緩存內存有關。 –

@LakshayGarg我認爲是一樣的，但C++實現對此根本不敏感。 – Leo

極不可能，但也許編譯器很聰明，可以爲你優化這個。你正在使用哪種編譯器？ –

我通過從np.uint16(u * (2**ibit))更改爲np.uint16(u << ibit)獲得4倍加速;即用一個bitshift代替2的冪，這應該是等效的（對於整數）。

似乎相當有可能是你的C++編譯器可能會犯這種替換本身。

交換兩個循環的順序對我來說對於你的原始版本（5％）和我的優化版本（15％）都有一個小的差異，所以我不認爲我可以對此做出有用的評論。

如果你真的想比較Numba和C++，你可以通過做os.environ['NUMBA_DUMP_ASSEMBLY']='1'導入Numba前看看編譯Numba功能。（儘管這很明顯）。

僅供參考，我使用的是Numba 0.19.1。

來源

2015-06-21 09:00:25 DavidW

謝謝，這顯着縮小了差距。現在C++只有30％的速度。請注意： np.uint16（u << ibit）實際上有一個錯誤。 Numba認爲「u」顯然是int8，例如，u << 10總是導致0.導致函數的行爲與純python版本不同。我不得不將其固定： np.uint16（U + 0）<< IBIT – Leo

@DavidW，@Leo：這是錯誤的已知問題？如果你能在Github上發佈一個問題，那將是非常棒的。無論如何，我認爲答案應該包含'np.uint16（u + 0）<< ibit' – cd98

對不起，我發現這在[issue 1241]（https://github.com/numba/numba/問題/ 1241） – cd98

Numba函數比C++慢並且循環重新排序進一步變慢x10

回答

相關問題