CUDA經同步問題

在概括內核這就是移動2D陣列一個空間到右（纏繞行邊界）的值，我所遇到的一個經同步問題。完整的代碼附在下面。CUDA經同步問題

的代碼是指對任意陣列寬度，高度陣列，線程塊的數目，並且每個塊的線程數工作。當選擇33的線程大小（即比完整的經線多一個線程）時，第33線程不與__syncthreads()同步被調用。這會導致輸出數據出現問題。這個問題只在存在多個warp時才存在，並且數組的寬度大於線程數（例如，width = 35和34個線程）。

以下是發生了什麼（現實中的陣列就需要有更多的元素內核產生錯誤）的，尺寸減小的例子。

初始陣列：

0 1 2 3 4 
5 6 7 8 9

預期結果：

4 0 1 2 3 
9 5 6 7 8

內核產地：

4 0 1 2 3 
8 5 6 7 8

第一行正確地完成（對於每個塊，如果有多於一個的），所有後續行都重複第二個最後一個值。我測試了這兩個不同的卡（8600GT和GTX280）並獲得相同的結果。我想知道這是否只是我的內核錯誤，或者是通過調整我的代碼無法解決的問題？

完整的源文件包含在下面。

謝謝。

#include <cstdio> 
#include <cstdlib> 

// A method to ensure all reads use the same logical layout. 
inline __device__ __host__ int loc(int x, int y, int width) 
{ 
    return y*width + x; 
} 

//kernel to shift all items in a 2D array one position to the right (wrapping around rows) 
__global__ void shiftRight (int* globalArray, int width, int height) 
{ 
    int temp1=0;   //temporary swap variables 
    int temp2=0; 

    int blockRange=0;  //the number of rows that a single block will shift 

    if (height%gridDim.x==0) //logic to account for awkward array sizes 
    blockRange = height/gridDim.x; 
    else 
    blockRange = (1+height/gridDim.x); 

    int yStart = blockIdx.x*blockRange; 
    int yEnd = yStart+blockRange; //the end condition for the y-loop 
    yEnd = min(height,yEnd);    //make sure that the array doesn't go out of bounds 

    for (int y = yStart; y < yEnd ; ++y) 
    { 
    //do the first read so the swap variables are loaded for the x-loop 
    temp1 = globalArray[loc(threadIdx.x,y,width)]; 
    //Each block shifts an entire row by itself, even if there are more columns than threads 
    for (int threadXOffset = threadIdx.x ; threadXOffset < width ; threadXOffset+=blockDim.x) 
    { 
     //blockDim.x is added so that we store the next round of values 
     //this has to be done now, because the next operation will 
     //overwrite one of these values 
     temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)]; 
     __syncthreads(); //sync before the write to ensure all the values have been read 
     globalArray[loc((threadXOffset +1)%width,y,width)] = temp1; 
     __syncthreads(); //sync after the write so ensure all the values have been written 
     temp1 = temp2;  //swap the storage variables. 
    } 
    if (threadIdx.x == 0 && y == 0) 
     globalArray[loc(12,2,width)]=globalArray[67]; 
    } 
} 


int main (int argc, char* argv[]) 
{ 
    //set the parameters to be used 
    int width = 34; 
    int height = 3; 
    int threadsPerBlock=33; 
    int numBlocks = 1; 

    int memSizeInBytes = width*height*sizeof(int); 

    //create the host data and assign each element of the array to equal its index 
    int* hostData = (int*) malloc (memSizeInBytes); 
    for (int y = 0 ; y < height ; ++y) 
    for (int x = 0 ; x < width ; ++x) 
     hostData [loc(x,y,width)] = loc(x,y,width); 

    //create an allocate the device pointers 
    int* deviceData; 
    cudaMalloc (&deviceData ,memSizeInBytes); 
    cudaMemset ( deviceData,0,memSizeInBytes); 
    cudaMemcpy ( deviceData, hostData, memSizeInBytes, cudaMemcpyHostToDevice); 
    cudaThreadSynchronize(); 

    //launch the kernel 
    shiftRight<<<numBlocks,threadsPerBlock>>> (deviceData, width, height); 
    cudaThreadSynchronize(); 

    //copy the device data to a host array 
    int* hostDeviceOutput = (int*) malloc (memSizeInBytes); 
    cudaMemcpy (hostDeviceOutput, deviceData, memSizeInBytes, cudaMemcpyDeviceToHost); 
    cudaFree (deviceData); 

    //Print out the expected/desired device output 
    printf("---- Expected Device Output ----\n"); 
    printf(" | "); 
    for (int x = 0 ; x < width ; ++x) 
    printf("%4d ",x); 
    printf("\n---|-"); 
    for (int x = 0 ; x < width ; ++x) 
    printf("-----"); 
    for (int y = 0 ; y < height ; ++y) 
    { 
    printf("\n%2d | ",y); 
    for (int x = 0 ; x < width ; ++x) 
     printf("%4d ",hostData[loc((x-1+width)%width,y,width)]); 
    } 
    printf("\n\n"); 

    printf("---- Actual Device Output ----\n"); 
    printf(" | "); 
    for (int x = 0 ; x < width ; ++x) 
    printf("%4d ",x); 
    printf("\n---|-"); 
    for (int x = 0 ; x < width ; ++x) 
    printf("-----"); 
    for (int y = 0 ; y < height ; ++y) 
    { 
    printf("\n%2d | ",y); 
    for (int x = 0 ; x < width ; ++x) 
     printf("%4d ",hostDeviceOutput[loc(x,y,width)]); 
    } 
    printf("\n\n"); 
}

來源

2011-02-23 dmc

我遇到下面的代碼太麻煩，但你可以嘗試，以便有沒有把結果寫入到新的全局存儲器讀/寫同步問題。如果它仍然不起作用，那麼你知道syncthreads不是你的問題。 – jmilloy 2011-02-24 01:18:39

因爲不是所有的線程都執行相同數量的循環迭代，同步是一個問題！所有的線程應該總是和__syncthreads（）一樣。

我會建議將您的for循環最裏面弄成這個樣子：

for(int blockXOffset=0; blockXOffset < width; blockXOffset+=blockDim.x) { 
    int threadXOffset=blockXOffset+threadIdx.x; 
    bool isActive=(threadXOffset < width); 
    if (isActive) temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)]; 
    __syncthreads(); 
    if (isActive) globalArray[loc((threadXOffset +1)%width,y,width)] = temp1; 
    __syncthreads(); 
    temp1 = temp2; 
}

來源

2011-02-26 18:18:03 CygnusX1

我有一個不同的方法。我在最後一次迭代之前結束了內部循環（所有線程都在內部循環內的最後一次迭代），並執行了循環外的所有線程都在同一點上的內部循環的最後一步。從邏輯上講，這應該是有效的（並且嚴格的「調試輸出」證實了這一點），但令人生氣的是，它並沒有導致非常奇怪的結果 - 在內核開始時顯式寫入到單獨的測試數組中並沒有發生。發現你的解決方案不會導致不合邏輯的行爲，並且看起來工作正常。謝謝。 – dmc 2011-02-27 21:42:55

從編程指南：

__syncthreads()被允許在有條件的代碼，但只有當條件估值在整個線程塊相同，否則代碼執行將有可能掛或產生意想不到的一面影響。

在我的例子，並不是所有線程都執行相同數量的循環迭代，所以同步不會發生。

來源

2011-02-24 07:58:33 dmc

CUDA經同步問題

回答

相關問題