CUDA非法內存訪問可能'不足'的共享內存

-2

我有一個簡單的CUDA內核，可以通過基本的縮減來做矢量累加。我將它擴展到能夠通過將其分割成多個塊來處理更大的數據。但是，我關於分配內核使用的適當數量的共享內存的假設失敗，導致非法內存訪問。當我增加這個限制時，它會消失，但我想知道爲什麼。這裏是我講的是代碼：CUDA非法內存訪問可能'不足'的共享內存

核心內核：

__global__ static 
    void vec_add(int *buffer, 
       int numElem, // The actual number of elements 
       int numIntermediates) // The next power of two of numElem 
    { 
     extern __shared__ unsigned int interim[]; 

     int index = blockDim.x * blockIdx.x + threadIdx.x; 

     // Copy global intermediate values into shared memory. 
     interim[threadIdx.x] = 
      (index < numElem) ? buffer[index] : 0; 

     __syncthreads(); 

     // numIntermediates2 *must* be a power of two! 
     for (unsigned int s = numIntermediates/2; s > 0; s >>= 1) { 
      if (threadIdx.x < s) { 
       interim[threadIdx.x] += interim[threadIdx.x + s]; 
      } 
      __syncthreads(); 
     } 

     if (threadIdx.x == 0) { 
      buffer[blockIdx.x] = interim[0]; 
     } 
    }

這是來電者：

void accumulate (int* buffer, int numElem) 
{ 
    unsigned int numReductionThreads = 
     nextPowerOfTwo(numElem); // A routine to return the next higher power of 2. 

    const unsigned int maxThreadsPerBlock = 1024; // deviceProp.maxThreadsPerBlock 

    unsigned int numThreadsPerBlock, numReductionBlocks, reductionBlockSharedDataSize; 

    while (numReductionThreads > 1) { 

     numThreadsPerBlock = numReductionThreads < maxThreadsPerBlock ?   
      numReductionThreads : maxThreadsPerBlock; 

     numReductionBlocks = (numReductionThreads + numThreadsPerBlock - 1)/numThreadsPerBlock; 

     reductionBlockSharedDataSize = numThreadsPerBlock * sizeof(unsigned int); 

     vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>> 
      (buffer, numElem, numReductionThreads); 

     numReductionThreads = nextPowerOfTwo(numReductionBlocks); 
    } 

}

我就試過這個代碼樣本集1152元的我的GPU具有以下配置：類型：的Quadro 600 MaxThreadsPerBlock：1024 MaxSharedMemory：48KB

OUTPUT：

Loop 1: numElem = 1152, numReductionThreads = 2048, numReductionBlocks = 2, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 4096 
Loop 2: numElem = 1152, numReductionThreads = 2, numReductionBlocks = 1, numThreadsPerBlock = 2, reductionBlockSharedDataSize = 8 
CUDA Error 77: an illegal memory access was encountered

他懷疑我的「臨時」的共享內存是導致非法的內存訪問，我隨意兩倍以下行增加了共享內存：

reductionBlockSharedDataSize = 2 * numThreadsPerBlock * sizeof(unsigned int);

而且我的內核啓動工作正常！

我不明白的是 - 爲什麼我不得不提供這個額外的共享內存來讓我的問題消失（暫時）。

作爲進一步的實驗來檢查這個神奇的數字，我用6912點的更大的數據集運行我的代碼。這一次，即使是2X或4X也沒有幫助我。

Loop 1: numElem = 6912, numReductionThreads = 8192, numReductionBlocks = 8, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 16384 

Loop 2: numElem = 6912, numReductionThreads = 8, numReductionBlocks = 1, numThreadsPerBlock = 8, reductionBlockSharedDataSize = 128 
CUDA Error 77: an illegal memory access was encountered

但是，當我將共享內存大小增加8倍時，問題再次消失。

當然，我不能隨意爲更大和更大的數據集選擇這個縮放因子，因爲我很快就會用完48KB共享內存限制。所以我想知道解決我的問題的合法方式。

來源

2016-11-12 gdilip

你算過大索引您循環：我用numThreadsPerBlock作爲參數固定的呢？從快速瀏覽中我認爲數學如下：'numIntermediates = 2048'（1152的下一個冪2），那麼's = 1024'和'threadIdx.x havogt

謝謝@havogt：你指出了錯誤的根源！ – gdilip

感謝@havogt指出了超出索引的訪問權限。問題是我使用了錯誤的參數作爲numIntermediates到vec_add方法。其目的是使內核在與線程數量完全相同的數據點上運行，而這些線程的數量應始終爲1024。

vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>> 
     (buffer, numElem, numThreadsPerBlock);

來源

2016-11-13 22:57:49 gdilip

CUDA非法內存訪問可能'不足'的共享內存

回答

相關問題