2013-10-14 47 views
1

我比較了幾種Python模塊/擴展或方法實現了以下內容:轉換功能NumbaPro CUDA

import numpy as np 

def fdtd(input_grid, steps): 
    grid = input_grid.copy() 
    old_grid = np.zeros_like(input_grid) 
    previous_grid = np.zeros_like(input_grid) 

    l_x = grid.shape[0] 
    l_y = grid.shape[1] 

    for i in range(steps): 
     np.copyto(previous_grid, old_grid) 
     np.copyto(old_grid, grid) 

     for x in range(l_x): 
      for y in range(l_y): 
       grid[x,y] = 0.0 
       if 0 < x+1 < l_x: 
        grid[x,y] += old_grid[x+1,y] 
       if 0 < x-1 < l_x: 
        grid[x,y] += old_grid[x-1,y] 
       if 0 < y+1 < l_y: 
        grid[x,y] += old_grid[x,y+1] 
       if 0 < y-1 < l_y: 
        grid[x,y] += old_grid[x,y-1] 

       grid[x,y] /= 2.0 
       grid[x,y] -= previous_grid[x,y] 

    return grid 

此功能是一個非常基本實現了有限差分時域(FDTD)方法。我實現了這個功能幾個方面:

  • 更NumPy的使用Numba(自動)JIT地用Cython
  • 例程

現在我想比較NumbaPro CUDA的性能。

這是我第一次爲CUDA編寫代碼,我想出了下面的代碼。

from numbapro import cuda, float32, int16 
import numpy as np 

@cuda.jit(argtypes=(float32[:,:], float32[:,:], float32[:,:], int16, int16, int16)) 
def kernel(grid, old_grid, previous_grid, steps, l_x, l_y): 

    x,y = cuda.grid(2) 

    for i in range(steps): 
     previous_grid[x,y] = old_grid[x,y] 
     old_grid[x,y] = grid[x,y] 

    for i in range(steps): 

     grid[x,y] = 0.0 

     if 0 < x+1 and x+1 < l_x: 
      grid[x,y] += old_grid[x+1,y] 
     if 0 < x-1 and x-1 < l_x: 
      grid[x,y] += old_grid[x-1,y] 
     if 0 < y+1 and y+1 < l_x: 
      grid[x,y] += old_grid[x,y+1] 
     if 0 < y-1 and y-1 < l_x: 
      grid[x,y] += old_grid[x,y-1] 

     grid[x,y] /= 2.0 
     grid[x,y] -= previous_grid[x,y] 


def fdtd(input_grid, steps): 

    grid = cuda.to_device(input_grid) 
    old_grid = cuda.to_device(np.zeros_like(input_grid)) 
    previous_grid = cuda.to_device(np.zeros_like(input_grid)) 

    l_x = input_grid.shape[0] 
    l_y = input_grid.shape[1] 

    kernel[(16,16),(32,8)](grid, old_grid, previous_grid, steps, l_x, l_y) 

    return grid.copy_to_host() 

不幸的是,我得到以下錯誤:

File ".../fdtd_numbapro.py", line 98, in fdtd 
    return grid.copy_to_host() 
    File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/devicearray.py", line 142, in copy_to_host 
    File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 1702, in device_to_host 
    File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 772, in check_error 
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED 
Failed to copy memory D->H 

我用grid.to_host(),以及和,將工作都不是。 CUDA肯定在這個系統上使用NumbaPro。

回答

1

我做了一些小的修改原來的代碼來獲取它Parakeet運行:

1)拆分化合物的比較,如「0 < x-1 < l_x「轉換爲」0 < x-1和x-1 < l_x「。

2)替換np.copyto顯式索引賦值(previous_grid [:,:] = old_grid)。

在那之後,我比較鸚鵡運行時針對原有的Python時間和Numba對有步驟的1000×1000網格autojit的C,OpenMP和CUDA後端= 20

Parakeet (backend = c) cold: fdtd : 0.5590s 
Parakeet (backend = c) warm: fdtd : 0.1260s 

Parakeet (backend = openmp) cold: fdtd : 0.4317s 
Parakeet (backend = openmp) warm: fdtd : 0.1693s 

Parakeet (backend = cuda) cold: fdtd : 2.6357s 
Parakeet (backend = cuda) warm: fdtd : 0.2455s 

Numba (autojit) cold: 672.3666s 
Numba (autojit) warm: 657.8858s 

Python: 203.3907s 

由於很少有現成的並行性在你的代碼中,並行後端實際上比順序後端更糟糕。這主要是由於Parakeet爲每個後端運行循環優化的不同,以及與CUDA內存傳輸和啓動OpenMP線程組相關的一些額外開銷。我不確定爲什麼Numba的autojit在這裏速度如此之慢,我敢肯定,使用類型註釋或使用NumbaPro會更快。

+0

感謝您對此進行測試!我沒有意識到你的項目鸚鵡;我會仔細看看它。令人驚訝的是,Numba太慢了。我不記得使用autojit功能時性能不佳。 – FRidh