根據CUDA中的偏移量訪問陣列的問題

這個問題很可能有一個簡單的解決方案。根據CUDA中的偏移量訪問陣列的問題

我生成的每個線程都將被初始化爲一個初始值。例如，如果我有一個字符集，char charSet[27] = "abcdefghijklmnopqrstuvwxyz"，我產生了26個線程。所以threadIdx.0對應於charSet[0] = a。夠簡單。

我以爲我想出了一個辦法做到這一點，直到我檢查我的什麼的線程在做...

下面是我寫了一個示例程序：

#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 
#include <stdio.h> 
#include <math.h> 
#include <stdlib.h> 

__global__ void example(int offset, int reqThreads){ 
//Declarations 
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x; 

    if(idx < reqThreads){ 
     unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x; //Used to initialize array <-----Problem is here 
     printf("%d ", tid); 
    }  
} 

int main(void){ 
    //Declarations 
    int minLength = 1; 
    int maxLength = 2; 
    int offset; 
    int totalThreads; 
    int reqThreads; 
    int base = 26; 
    int maxThreads = 512; 
    int blocks; 
    int i,j; 

    for(i = minLength; i<=maxLength; i++){ 
     offset = i; 

     //Calculate parameters 
     reqThreads = (int) pow((double) base, (double) offset); //Casting I would never do, but works here 
     totalThreads = reqThreads; 

     for(j = 1;(totalThreads % maxThreads) != 0; j++) totalThreads += 1; //Create a multiple of 512 

     blocks = totalThreads/maxThreads; 

     //Call the kernel 

     example<<<blocks, totalThreads>>>(offset, reqThreads); 
     cudaThreadSynchronize(); 
     printf("\n\n"); 
    } 

    system("pause"); 
    return 0; 
}

我的理由是，這聲明

unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x;

將允許我引入抵消。如果offset是2,threadIdx.0 * offset = 0,threadIdx.1 * offset = 2,threadIdx.2 * offset = 4等等。這絕對不會發生。當偏移上述程序的輸出工作== 1：我的數組的範圍外

1344 1346 1348 1350...

事實上，這些值的方法：

0 1 2 3 4 5...26.

但是，當偏移== 2。所以我不確定發生了什麼問題。

讚賞任何建設性意見。

來源

2013-12-09 Mlagma

我不認爲ÿ你理解cuda線程，正確地阻止概念。請通過這個[鏈接]（http://docs.nvidia.com/cuda/cuda-c-programming-guide/）。 –

@SagarMasuti你能否詳細說明我的理解在哪裏？ – Mlagma

我的appologies，如果我理解你錯了。根據你的解釋，你只需要26個線程，但是你在第一次迭代中啓動內核（blocks = 1，threads = 512 = 512個線程），在第二次迭代中（blocks = 2，threads = 1024 = 2048個線程）。 –

我相信你的內核調用應該是這樣的：

example<<<blocks, maxThreads>>>(offset, reqThreads);

你的目的是要推出的512個線程的線程塊，從而使數字（maxThreads）應該是你的第二個內核配置參數，該參數是多少每塊的線程數。

此外，不贊成這種方式：

cudaThreadSynchronize();

使用這個代替：

cudaDeviceSynchronize();

如果你使用printf從內核的大量輸出，你可以lose some of the output if you exceed the buffer。

最後，我不確定你的推理對於要打印的索引範圍是否正確。

當偏移量= 2（第二次通過循環），然後26^2 = 676，然後您將最終得到1024個線程（如果您做出上述修復，則在2個線程塊中）。第二threadblock將具有

tid = (2*threadIdx.x) + blockDim.x*blockIdx.x; 
     (0..164)  (512)   (1)

所以第二threadblock應該打印出的512（最小）指數高達（2 * 164）+ 512 = 900

（164 = 675 - 511）

第一threadblock應打印出的指標：

tid = (2*threadIdx.x) + blockDim.x * blockIdx.x 
      (0..511)  (512)   (0)

即0到1022

來源

2013-12-09 03:26:22

是的，我的邏輯在這裏出了問題。意圖是隻有676個線程，但是我將其取整爲1024來啓動。具體來說，我將有一個676個長度爲2的組合，共1352個字符。每個線程偏移2個字符。線程通過循環讀取2個字符（它們被2偏移），最後有676個線程。 – Mlagma

其實我忘了你有一個線程檢查。所以我修改了我的答案。你的線程檢查（這是很好的）可以阻止任何高於676的線程執行。這不會影響第一個塊中的任何線程，但會限制第二個塊中的某些線程執行。但是，我認爲它仍然會爲您創建奇怪的索引。 –

有些東西絕對是關閉的。在我的完整程序中，線程的初始化僅適用於線程0.它必須是'tid'。當我運行這個例子時，當偏移量不是1時，數字看起來完全是隨機的。（我沒有在我的實際程序中使用printf）。這只是爲了看看發生了什麼。 – Mlagma

根據CUDA中的偏移量訪問陣列的問題

回答

相關問題