在CUDA中共享mem的非方形矩陣轉置

我試圖獲得各種大小的SDK矩陣轉置樣本的變體。簡而言之，我必須將一個輸入數組（double * a）寫入一個更大矩陣（double * tab）的兩個不同部分（您會注意到不同的偏移量）。我存儲的數據行的主要格式，所以我用這個宏索引：在CUDA中共享mem的非方形矩陣轉置

#define IDX2L(i,j,ld) (((i)*ld))+(j)) // 0 based index +row-major format

這是一個簡單的代碼，我使用。

__global__ void cuda_a_Coalesced(double *tab, int tab_rows, int a_rows, double *a) 
{ 
    __shared__ double tile[16*(16+1)]; 
    int col = threadIdx.x + blockIdx.x * blockDim.x; 
    int row = threadIdx.y + blockIdx.y * blockDim.y; 

    int col_2, row_2; 
    int a_cols=tab_rows-a_rows; // tab_rows-a_rows is the number of columns of a 
    int tab_cols=2*tab_rows+2; // 2*tab_rows+2 is the number of columns of tab 

    if((col<a_cols) && (row<a_rows)) 
    { 
     // Load the data into shared mem 
     tile[threadIdx.x+threadIdx.y*(16+1)]=a[IDX2L(row,col,a_cols)]; 

     // Normal copy (+ offsets) 
     tab[IDX2L(row,col+tab_rows+a_rows,tab_cols)]= tile[threadIdx.x+threadIdx.y*(16+1)]; 

     // New idx 
     col_2 = blockIdx.y * blockDim.y + threadIdx.x; 
     row_2 = blockIdx.x * blockDim.x + threadIdx.y; 
    } 
    __syncthreads(); 

    if((row_2<a_cols) && (col_2<a_rows)) 
     // Transpose (+ other offsets) 
     tab[IDX2L(row_2+a_rows,col_2+tab_rows,tab_cols)]= -tile[threadIdx.y+threadIdx.x*(16+1)]; 

}

的啓動參數的優勢如下：

b1=(int)ceil((float)a_cols/16); 
b2=(int)ceil((float)a_rows/16); 
dim bck(b1,b2):dim th(16,16); 

cuda_a_Coalesced<<<bck,th>>>(tab,tab_rows,a_rows,a);

普通副本總是很好地進行大小無關。轉置副本僅適用於塊大小整數倍的大小（如SDK示例中所示）。當轉置複製失敗時，操作的某些部分是正確的，而其他部分不是，這是我無法準確預測或跟蹤的。請注意，因爲這個想法是更改共享內存中的索引，以便轉置可以在輸出矩陣中以合併形式寫入（由於行主格式）。

有人可以告訴我爲什麼代碼只適用於那種尺寸的原因？

解決這種情況有什麼竅門嗎？

來源

2012-11-03 engineer H

你確定代碼是正確的嗎？乍一看似乎對於某些線程col_2和row_2可能未定義 – 2012-11-10 08:56:18

在此版本之前，我嘗試將涉及col_2和row_2的操作放在第一個if語句中，顯然是在__syncthreads（）之後，但我遇到了同樣的問題。那是你所指的？ –

不，我的意思是對於條件'（col 2012-11-13 14:26:31

問題出在一些不確定的線程，因爲col_2和row_2的值是在沒有所有線程訪問的if（）語句中分配的。

爲了解決這種情況，我們可以給COL_2和row_2價值時，我們聲明這些變量並刪除homonimous計算是有內發生所提到的，如果（）：

__shared__ double tile[16*(16+1)]; 

int col = threadIdx.x + blockIdx.x * blockDim.x; 
int row = threadIdx.y + blockIdx.y * blockDim.y; 

int col_2 = blockIdx.y * blockDim.y + threadIdx.x; 
int row_2 = blockIdx.x * blockDim.x + threadIdx.y; 

int a_cols=tab_rows-a_rows; 
int tab_cols=2*tab_rows+2;

因此，在剩下的代碼如下所示：

if((col<a_cols) && (row<a_rows)) 
{ 
    // Load the data into shared mem 
    tile[threadIdx.x+threadIdx.y*(16+1)]=a[IDX2L(row,col,a_cols)]; 
    // Normal copy (+ offsets) 
    tab[IDX2L(row,col+tab_rows+a_rows,tab_cols)]= tile[threadIdx.x+threadIdx.y*(16+1)]; 
} 
__syncthreads(); 

if((row_2<a_cols) && (col_2<a_rows)) 
    // Transpose (+ other offsets) 
    tab[IDX2L(row_2+a_rows,col_2+tab_rows,tab_cols)]= -tile[threadIdx.y+threadIdx.x*(16+1)];

來源

2012-11-15 13:27:10

在CUDA中共享mem的非方形矩陣轉置

回答

相關問題