入門用下面的代碼打開始與共享內存PyCUDA

我試圖理解共享內存：入門用下面的代碼打開始與共享內存PyCUDA

import pycuda.driver as drv 
import pycuda.tools 
import pycuda.autoinit 
import numpy 
from pycuda.compiler import SourceModule 

src=''' 
__global__ void reduce0(float *g_idata, float *g_odata) { 
extern __shared__ float sdata[]; 
// each thread loads one element from global to shared mem 
unsigned int tid = threadIdx.x; 
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; 

sdata[tid] = g_idata[i]; 
__syncthreads(); 
// do reduction in shared mem 
for(unsigned int s=1; s < blockDim.x; s *= 2) { 
    if (tid % (2*s) == 0) { 
     sdata[tid] += sdata[tid + s]; 
    } 
__syncthreads(); 
} 
// write result for this block to global mem 
if (tid == 0) g_odata[blockIdx.x] = sdata[0]; 
} 
''' 

mod = SourceModule(src) 
reduce0=mod.get_function('reduce0') 

a = numpy.random.randn(400).astype(numpy.float32) 

dest = numpy.zeros_like(a) 
reduce0(drv.In(a),drv.Out(dest),block=(400,1,1))

我看不出什麼明顯的錯，但我不斷收到同步錯誤和它不運行。

任何幫助非常感謝。

來源

2015-06-07 reckoner

您還沒有指定一個共享內存大小的內核啓動。 – talonmies

原來，extern __shared__ float sdata [];是nvcc編譯器需要它的方式。 – reckoner

是的，但是當內核使用動態分配的共享內存時，必須以字節傳遞共享內存分配大小作爲內核啓動參數。您發佈的代碼不會這樣做。 – talonmies

當您指定

extern __shared__ float sdata[];

你告訴編譯器調用方將提供共享內存。在PyCUDA中，通過在調用CUDA函數的行上指定shared=nnnn來完成。在你的情況，是這樣的：

reduce0(drv.In(a),drv.Out(dest),block=(400,1,1),shared=4*400)

或者，你可以刪除extern關鍵字，並直接指定共享內存：

__shared__ float sdata[400];

來源

2015-06-09 22:42:51 ime

實際上，我不得不放棄extern關鍵字，並使用'shared ='參數讓它工作。 – reckoner

入門用下面的代碼打開始與共享內存PyCUDA

回答

相關問題