對CUDA

我要生成的CUDA一些決策樹，下面我們就僞碼Genarating決策樹（代碼是非常原始的，它只是爲了解我寫的）：對CUDA

class Node 
{ 
public : 
    Node* father; 
    Node** sons; 
    int countSons; 

    __device__ __host__ Node(Node* father) 
    { 
     this->father = father; 
     sons = NULL; 
    } 
}; 

__global__ void GenerateSons(Node** fathers, int countFathers*, Node** sons, int* countSons) 
{ 
    int Thread_Index = (blockDim.x * blockIdx.x) + threadIdx.x; 

    if(Thread_Index < *(countFathers)) 
    { 
     Node* Thread_Father = fathers[Thread_Index]; 
     Node** Thread_Sons; 
     int Thread_countSons; 
     //Now we are creating new sons for our Thread_Father 
     /* 
     * Generating Thread_Sons for Thread_Father; 
     */ 
     Thread_Father->sons = Thread_Sons; 
     Thread_Father->countSons = Thread_countSons; 

     //Wait for others 
      /*I added here __syncthreads because I want to count all generated sons 
      by threads 
      */ 
      *(countSons) += Thread_countSons; 
     __syncthreads(); 

     //Get all generated sons from whole Block and copy to sons 

     if(threadIdx.x == 0) 
     { 
      sons = new Node*[*(countSons)]; 
     } 
     /*I added here __syncthreads because I want to allocated array for sons 
      */ 
     __syncthreads(); 

     int Thread_Offset; 
     /* 
     * Get correct offset for actual thread 
     */ 
     for(int i = 0; i < Thread_countSons; i++) 
      sons[Thread_Offset + i] = Thread_Sons[i]; 
    } 
} 

void main() 
{ 
    Node* root = new Node(); 
    //transfer root to kernel by cudaMalloc and cudaMemcpy 
    Node* root_d = root->transfer(); 

    Node** fathers_d; 
    /* 
    * preapre array with father root and copy him to kernel 
    */ 

    int* countFathers, countSons; 
    /* 
    * preapre pointer of int for kernel and for countFathers set value 1 
    */ 

    for(int i = 0; i < LevelTree; i++) 
    { 
     Node** sons = NULL; 
     int threadsPerBlock = 256; 
     int blocksPerGrid = (*(countFathers)/*get count of fathers*/ + threadsPerBlock - 1)/threadsPerBlock; 
     GenerateSons<<<blocksPerGrid , threadsPerBlock >>>(fathers_d, countFathers, sons, countSons); 
     //Wait for end of kernel call 
     cudaDeviceSynchronize(); 

     //replace 
     fathers_d = sons; 
     countFathers = countSons; 
    } 
}

所以，適用於5級（爲跳棋生成決策樹），但在6級上我有錯誤。在內核代碼的某個地方，malloc返回NULL，對我來說，這是blockThreads中的某些線程無法分配更多內存的信息。我非常肯定，我正在清理所有我不需要的對象，在調用內核的每一端。我在想，我無法理解CUDA中使用內存的一些事實。如果我在線程的本地內存中創建對象，並且內核結束了他的活動，那麼在內核的secound開始時，我可以看到內核的第一個調用的節點是。所以我的問題是第一次調用內核的對象Node在哪裏存儲？它們是否存儲在線程的本地內存中？所以如果這是真的，那麼在每次調用我的內核函數時，我會減少此線程的本地內存空間？

對不起，我的英文不好，如果有什麼不清楚。

我使用GT555米計算能力2.1，CUDA SDK 5.0，Visual Studio 2010中的高級與NSight 3.0

來源

2013-02-23 waskithebest

你在內核中調用new並且從不調用delete。由於您正在使用____global___ void GenerateSons，我敢打賭，您在設備上的內存不足。 – AlexLordThorsen 2013-02-23 10:41:45

好吧，我的設備有2Gb的空間，並且sizeof（Node）= 28。首先調用genrate 7個兒子，secound 49，next 379和最後的正確調用2769.因此，我的設備生成了3204個兒子，產生87Kb？ – waskithebest 2013-02-23 11:19:44

嗯，我想知道新的內存是否從共享內存拉出。我將不得不查閱文檔。 – AlexLordThorsen 2013-02-23 14:09:36

歐凱，

我發現，在內核new和malloc援引分配全局內存在設備上。我還發現這

默認情況下，CUDA創建一個8MB的堆。

CUDA Application Design and Development, page 128

所以，我用這個方法cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);增加對設備到128MB和正確生成第6級樹（22110個子孫）的程序堆內存，但實際上我得到一些內存泄漏。 ..我需要找到。

來源

2013-02-24 17:38:40 waskithebest

回答

相關問題