CUDA：線程中的變量聲明 - 是否有重疊？

首先，我的問題措辭不正確;如果我使用NVidia的CUDA C編程指南中的例子，我認爲它會更好。CUDA：線程中的變量聲明 - 是否有重疊？

在第3.2.3節（共享內存）中，給出了使用共享內存的矩陣乘法的以下代碼 - 我希望在這裏複製它是可以的。

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) 
{ 
// Block row and column 
int blockRow = blockIdx.y; 
int blockCol = blockIdx.x; 

// Each thread block computes one sub-matrix Csub of C 
Matrix Csub = GetSubMatrix(C, blockRow, blockCol); 

// Each thread computes one element of Csub 
// by accumulating results into Cvalue 
float Cvalue = 0; 

// Thread row and column within Csub 
int row = threadIdx.y; 
int col = threadIdx.x; 

// Loop over all the sub-matrices of A and B that are 
// required to compute Csub 
// Multiply each pair of sub-matrices together 
// and accumulate the results 
for (int m = 0; m < (A.width/BLOCK_SIZE); ++m) { 

    // Get sub-matrix Asub of A 
    Matrix Asub = GetSubMatrix(A, blockRow, m); 

    // Get sub-matrix Bsub of B 
    Matrix Bsub = GetSubMatrix(B, m, blockCol); 

    // Shared memory used to store Asub and Bsub respectively 
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; 
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; 

    // Load Asub and Bsub from device memory to shared memory 
    // Each thread loads one element of each sub-matrix 
    As[row][col] = GetElement(Asub, row, col); 
    Bs[row][col] = GetElement(Bsub, row, col); 

    // Synchronize to make sure the sub-matrices are loaded 
    // before starting the computation 
    __syncthreads(); 

    // Multiply Asub and Bsub together 
    for (int e = 0; e < BLOCK_SIZE; ++e) 
     Cvalue += As[row][e] * Bs[e][col]; 

    // Synchronize to make sure that the preceding 
    // computation is done before loading two new 
    // sub-matrices of A and B in the next iteration 
    __syncthreads(); 
} 

// Write Csub to device memory 
// Each thread writes one element 
SetElement(Csub, row, col, Cvalue); 
}

第7行：矩陣Csub的= GetSubMatrix（C，blockRow，blockCol），將每一個線程執行該語句？這不會使使用共享內存的全部內容減少全局內存訪問量嗎？我的印象是，這裏有一些基本的東西，我在這裏失蹤..

此外，當然有更好的方法來說這個問題。我只是不知道如何！

感謝，

Zakiir

來源

2014-03-06 zedjay72

每個線程執行在同一時間同一指令（或處於空閒狀態），所以每一個線程進入GetSubMatrix肯定。每個線程需要幾個項目。因此，如果有N線程和3N項目將被複制每個線程將複製3.

例如，如果我複製一個載體，我可能會做以下

float from* = ???; 
float to* = ???; 
int num = ???; 
int thread = threadIdx.x + threadIdx.y*blockDim.x ...; // A linear index 
int num_threads = blockDim.x * blockDim.y * blockDim.z; 
for(int i=threadIdx.x; i < num; i+= num_threads) { 
    to[i] = from[i]; 
}

每個線程參與複製一次一點。順便說一句：如果你能設法讓所有的線程複製一系列連續的元素，你就可以在副本中獲得額外的速度。

來源

2014-03-06 02:52:21 Cramer

我知道每個線程都會加載每個子矩陣A和B中的一個元素，並且當完成時，塊中的所有線程將能夠讀取彼此共享的內存以進行部分矩陣乘法。我仍然困惑於爲什麼每個線程需要創建它自己的C子矩陣，因爲每個線程只寫入一個元素。 – zedjay72

你有訪問GetSubMatrix嗎？這可能只是複製地址位置而不是自己複製元素。如果是這種情況，那麼每個線程將獲得Matrix結構/類的副本。每個人都會保存一個'__syncthreads（）'，並且不會花費一個以上的線程（每個線程必須同時運行相同的指令）。 – Cramer

GetSubMatrix返回一個Matrix，Csub，它是最終產品矩陣的一部分; Matrix是一個在代碼中定義的結構。 – zedjay72

CUDA：線程中的變量聲明 - 是否有重疊？

回答

相關問題