GPU上的連續內存分配

cudaMalloc是否分配連續的內存塊（即彼此相鄰的物理字節）？GPU上的連續內存分配

我有一段CUDA代碼，它使用32個線程將全局設備內存中的128個字節複製到共享內存。我試圖找到一種方法來保證這個傳輸可以在一個128字節的內存事務中完成。如果cudaMalloc分配連續的內存塊，那麼它可以很容易地完成。

以下是代碼：

#include <iostream> 

using namespace std; 
#define SIZE 32 //SIZE of the array to store in shared memory                               
#define NUMTHREADS 32 
__global__ void copy(uint* memPointer){ 

    extern __shared__ uint bits[]; 
    int tid = threadIdx.x; 

    bits[tid] = memPointer[tid]; 

} 

int main(){ 
    uint inputData[SIZE]; 
    uint* storedData; 
    for(int i=0;i<SIZE;i++){ 
    inputData[i] = i; 
    } 
    cudaError_t e1=cudaMalloc((void**) &storedData, sizeof(uint)*SIZE); 
    if(e1 == cudaSuccess){ 
    cudaError_t e3= cudaMemcpy(storedData, inputData, sizeof(uint)*SIZE, cudaMemcpyHostToDevice); 
     if(e3==cudaSuccess){ 
     copy<<<1,NUMTHREADS, SIZE*4>>>(storedData); 
      cudaError_t e6 = cudaFree(storedData); 
      if(e6==cudaSuccess){ 
      } 
      else{ 
       cout << "Error freeing memory storedData" << e6 << endl; 
      } 
     } 
     else{ 
     cout << "Failed to copy" << " " << e3 << endl; 
     } 

    } 
    else{ 
    cout << "Failed to allocate memory" << " " << e1 << endl; 

    } 
    return 0; 
}

來源

2012-07-02 gmemon

該內核應該服務的目的是什麼？ – talonmies

它是我在其中對數據執行一些操作的較大代碼的一部分。我正在嘗試優化代碼的各個部分。 – gmemon

如果128字節塊是128字節對齊，那麼這將在一個事務中完成。 NVIDIA GPU具有獨立於CPU MMU的MMU。所有GPU內存操作都是通過GPU虛擬地址空間完成的。不能保證大於緩存行的塊物理上連續。 –

是，cudaMalloc分配存儲連續塊。 SDK中的「Matrix Transpose」示例（http://developer.nvidia.com/cuda-cc-sdk-code-samples）有一個名爲「copySharedMem」的內核，它幾乎完全符合您所描述的內容。

來源

2012-07-02 17:11:19 azaghal

GPU上的連續內存分配

回答

相關問題