2016-11-12 156 views
-2

我有一個簡單的CUDA內核,可以通過基本的縮減來做矢量累加。我將它擴展到能夠通過將其分割成多個塊來處理更大的數據。但是,我關於分配內核使用的適當數量的共享內存的假設失敗,導致非法內存訪問。當我增加這個限制時,它會消失,但我想知道爲什麼。 這裏是我講的是代碼:CUDA非法內存訪問可能'不足'的共享內存

核心內核:

__global__ static 
    void vec_add(int *buffer, 
       int numElem, // The actual number of elements 
       int numIntermediates) // The next power of two of numElem 
    { 
     extern __shared__ unsigned int interim[]; 

     int index = blockDim.x * blockIdx.x + threadIdx.x; 

     // Copy global intermediate values into shared memory. 
     interim[threadIdx.x] = 
      (index < numElem) ? buffer[index] : 0; 

     __syncthreads(); 

     // numIntermediates2 *must* be a power of two! 
     for (unsigned int s = numIntermediates/2; s > 0; s >>= 1) { 
      if (threadIdx.x < s) { 
       interim[threadIdx.x] += interim[threadIdx.x + s]; 
      } 
      __syncthreads(); 
     } 

     if (threadIdx.x == 0) { 
      buffer[blockIdx.x] = interim[0]; 
     } 
    } 

這是來電者:

void accumulate (int* buffer, int numElem) 
{ 
    unsigned int numReductionThreads = 
     nextPowerOfTwo(numElem); // A routine to return the next higher power of 2. 

    const unsigned int maxThreadsPerBlock = 1024; // deviceProp.maxThreadsPerBlock 

    unsigned int numThreadsPerBlock, numReductionBlocks, reductionBlockSharedDataSize; 

    while (numReductionThreads > 1) { 

     numThreadsPerBlock = numReductionThreads < maxThreadsPerBlock ?   
      numReductionThreads : maxThreadsPerBlock; 

     numReductionBlocks = (numReductionThreads + numThreadsPerBlock - 1)/numThreadsPerBlock; 

     reductionBlockSharedDataSize = numThreadsPerBlock * sizeof(unsigned int); 

     vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>> 
      (buffer, numElem, numReductionThreads); 

     numReductionThreads = nextPowerOfTwo(numReductionBlocks); 
    } 

} 

我就試過這個代碼樣本集1152元的我的GPU具有以下配置: 類型:的Quadro 600 MaxThreadsPerBlock:1024 MaxSharedMemory:48KB

OUTPUT:

Loop 1: numElem = 1152, numReductionThreads = 2048, numReductionBlocks = 2, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 4096 
Loop 2: numElem = 1152, numReductionThreads = 2, numReductionBlocks = 1, numThreadsPerBlock = 2, reductionBlockSharedDataSize = 8 
CUDA Error 77: an illegal memory access was encountered 

他懷疑我的「臨時」的共享內存是導致非法的內存訪問,我隨意兩倍以下行增加了共享內存:

reductionBlockSharedDataSize = 2 * numThreadsPerBlock * sizeof(unsigned int); 

而且我的內核啓動工作正常!

我不明白的是 - 爲什麼我不得不提供這個額外的共享內存來讓我的問題消失(暫時)。

作爲進一步的實驗來檢查這個神奇的數字,我用6912點的更大的數據集運行我的代碼。這一次,即使是2X或4X也沒有幫助我。

Loop 1: numElem = 6912, numReductionThreads = 8192, numReductionBlocks = 8, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 16384 

Loop 2: numElem = 6912, numReductionThreads = 8, numReductionBlocks = 1, numThreadsPerBlock = 8, reductionBlockSharedDataSize = 128 
CUDA Error 77: an illegal memory access was encountered 

但是,當我將共享內存大小增加8倍時,問題再次消失。

當然,我不能隨意爲更大和更大的數據集選擇這個縮放因子,因爲我很快就會用完48KB共享內存限制。所以我想知道解決我的問題的合法方式。

+1

你算過大索引您循環: 我用numThreadsPerBlock作爲參數固定的呢?從快速瀏覽中我認爲數學如下:'numIntermediates = 2048'(1152的下一個冪2),那麼's = 1024'和'threadIdx.x havogt

+0

謝謝@havogt:你指出了錯誤的根源! – gdilip

回答

1

感謝@havogt指出了超出索引的訪問權限。 問題是我使用了錯誤的參數作爲numIntermediates到vec_add方法。其目的是使內核在與線程數量完全相同的數據點上運行,而這些線程的數量應始終爲1024。

vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>> 
     (buffer, numElem, numThreadsPerBlock);