我有一個簡單的CUDA內核,可以通過基本的縮減來做矢量累加。我將它擴展到能夠通過將其分割成多個塊來處理更大的數據。但是,我關於分配內核使用的適當數量的共享內存的假設失敗,導致非法內存訪問。當我增加這個限制時,它會消失,但我想知道爲什麼。 這裏是我講的是代碼:CUDA非法內存訪問可能'不足'的共享內存
核心內核:
__global__ static
void vec_add(int *buffer,
int numElem, // The actual number of elements
int numIntermediates) // The next power of two of numElem
{
extern __shared__ unsigned int interim[];
int index = blockDim.x * blockIdx.x + threadIdx.x;
// Copy global intermediate values into shared memory.
interim[threadIdx.x] =
(index < numElem) ? buffer[index] : 0;
__syncthreads();
// numIntermediates2 *must* be a power of two!
for (unsigned int s = numIntermediates/2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
interim[threadIdx.x] += interim[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
buffer[blockIdx.x] = interim[0];
}
}
這是來電者:
void accumulate (int* buffer, int numElem)
{
unsigned int numReductionThreads =
nextPowerOfTwo(numElem); // A routine to return the next higher power of 2.
const unsigned int maxThreadsPerBlock = 1024; // deviceProp.maxThreadsPerBlock
unsigned int numThreadsPerBlock, numReductionBlocks, reductionBlockSharedDataSize;
while (numReductionThreads > 1) {
numThreadsPerBlock = numReductionThreads < maxThreadsPerBlock ?
numReductionThreads : maxThreadsPerBlock;
numReductionBlocks = (numReductionThreads + numThreadsPerBlock - 1)/numThreadsPerBlock;
reductionBlockSharedDataSize = numThreadsPerBlock * sizeof(unsigned int);
vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>>
(buffer, numElem, numReductionThreads);
numReductionThreads = nextPowerOfTwo(numReductionBlocks);
}
}
我就試過這個代碼樣本集1152元的我的GPU具有以下配置: 類型:的Quadro 600 MaxThreadsPerBlock:1024 MaxSharedMemory:48KB
OUTPUT:
Loop 1: numElem = 1152, numReductionThreads = 2048, numReductionBlocks = 2, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 4096
Loop 2: numElem = 1152, numReductionThreads = 2, numReductionBlocks = 1, numThreadsPerBlock = 2, reductionBlockSharedDataSize = 8
CUDA Error 77: an illegal memory access was encountered
他懷疑我的「臨時」的共享內存是導致非法的內存訪問,我隨意兩倍以下行增加了共享內存:
reductionBlockSharedDataSize = 2 * numThreadsPerBlock * sizeof(unsigned int);
而且我的內核啓動工作正常!
我不明白的是 - 爲什麼我不得不提供這個額外的共享內存來讓我的問題消失(暫時)。
作爲進一步的實驗來檢查這個神奇的數字,我用6912點的更大的數據集運行我的代碼。這一次,即使是2X或4X也沒有幫助我。
Loop 1: numElem = 6912, numReductionThreads = 8192, numReductionBlocks = 8, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 16384
Loop 2: numElem = 6912, numReductionThreads = 8, numReductionBlocks = 1, numThreadsPerBlock = 8, reductionBlockSharedDataSize = 128
CUDA Error 77: an illegal memory access was encountered
但是,當我將共享內存大小增加8倍時,問題再次消失。
當然,我不能隨意爲更大和更大的數據集選擇這個縮放因子,因爲我很快就會用完48KB共享內存限制。所以我想知道解決我的問題的合法方式。
你算過大索引您循環: 我用numThreadsPerBlock作爲參數固定的呢?從快速瀏覽中我認爲數學如下:'numIntermediates = 2048'(1152的下一個冪2),那麼's = 1024'和'threadIdx.x
havogt謝謝@havogt:你指出了錯誤的根源! – gdilip