2012-06-03 65 views
0

我是CUDA C的新手,我試圖並行化slave_sort函數的以下一段代碼,你會發現它已經與posix線程並行工作。 我有以下結構:Cuda編程 - 將嵌套結構傳遞給內核

typedef struct{ 
    long densities[MAX_RADIX]; 
    long ranks[MAX_RADIX]; 
    char pad[PAGE_SIZE]; 
}prefix_node; 

struct global_memory { 
    long Index;        /* process ID */ 
    struct prefix_node prefix_tree[2 * MAX_PROCESSORS]; 
} *global; 

void slave_sort(){ 
     . 
     . 
     . 
long *rank_me_mynum; 
struct prefix_node* n; 
struct prefix_node* r; 
struct prefix_node* l; 
     . 
     . 
MyNum = global->Index; 
global->Index++; 
n = &(global->prefix_tree[MyNum]); 
    for (i = 0; i < radix; i++) { 
     n->densities[i] = key_density[i]; 
     n->ranks[i] = rank_me_mynum[i]; 
    } 
    offset = MyNum; 
    level = number_of_processors >> 1; 
    base = number_of_processors; 
    while ((offset & 0x1) != 0) { 
     offset >>= 1; 
     r = n; 
     l = n - 1; 
     index = base + offset; 
     n = &(global->prefix_tree[index]); 
     if (offset != (level - 1)) { 
     for (i = 0; i < radix; i++) { 
      n->densities[i] = r->densities[i] + l->densities[i]; 
      n->ranks[i] = r->ranks[i] + l->ranks[i]; 
     } 
     } else { 
     for (i = 0; i < radix; i++) { 
      n->densities[i] = r->densities[i] + l->densities[i]; 
     } 
     } 
     base += level; 
     level >>= 1; 
} 

Mynum是處理器的數量。我想要將代碼傳遞給內核之後,Mynum就是represented by blockIdx.x。問題是我對結構體感到困惑。我不知道如何在內核中傳遞它們。誰能幫我?

下面的代碼是否正確?

__global__ void testkernel(prefix_node *prefix_tree, long *dev_rank_me_mynum, long *key_density,long radix) 

int i = threadIdx.x + blockIdx.x*blockDimx.x; 
prefix_node *n; 
prefix_node *l; 
prefix_node *r; 
long offset; 
    . 
    . 
    . 
n = &prefix_tree[blockIdx.x]; 
if((i%numthreads) == 0){ 
    for(int j=0; j<radix; j++){ 
     n->densities[j] = key_density[j + radix*blockIdx.x]; 
     n->ranks[i] = dev_rank_me_mynum[j + radix*blockIdx.x]; 
    } 
    . 
    . 
    . 
} 


int main(...){ 

    long *dev_rank_me_mynum; 
    long *key_density; 
    prefix_node *prefix_tree; 
    long radix = 1024; 

    cudaMalloc((void**)&dev_rank_me_mynum, radix*numblocks*sizeof(long)); 
    cudaMalloc((void**)&key_density, radix*numblocks*sizeof(long)); 
    cudaMalloc((void**)&prefix_tree, numblocks*sizeof(prefix_node)); 

    testkernel<<<numblocks,numthreads>>>(prefix_tree,dev_runk_me_mynum,key_density,radix); 
} 
+1

你能編輯你的問題來顯示你的CUDA代碼,並且準確地解釋代碼的哪個部分沒有'工作?就目前來看,這不是一個問題..... – talonmies

回答

0

您在編輯中發佈的主機API代碼看起來很好。 prefix_node結構僅包含靜態聲明的數組,因此所需的全部內容都是調用內存以供內核使用。將prefix_tree傳遞給內核的方法也很好。

內核代碼雖然不完整並且包含幾個明顯的拼寫錯誤,但卻是另一回事。看起來你的意圖是每個塊只有一個線程在prefix_tree的一個「節點」上運行。這將是非常低效的,並且只使用GPU的一小部分總容量。例如,爲什麼這樣做:

prefix_node *n = &prefix_tree[blockIdx.x]; 
if((i%numthreads) == 0){ 
    for(int j=0; j<radix; j++){ 
     n->densities[j] = key_density[j + radix*blockIdx.x]; 
     n->ranks[j] = dev_rank_me_mynum[j + radix*blockIdx.x]; 
    } 
    . 
    . 
    . 
} 

時,你可以這樣:

prefix_node *n = &prefix_tree[blockIdx.x]; 
for(int j=threadIdx.x; j<radix; j+=blockDim.x){ 
    n->densities[j] = key_density[j + radix*blockIdx.x]; 
    n->ranks[j] = dev_rank_me_mynum[j + radix*blockIdx.x]; 
} 

其聚結的內存讀取和您選擇運行在塊使用盡可能多的線程,而不是僅僅一個結果應該快很多倍。因此,也許你應該重新考慮直接嘗試將你發佈到內核中的串行C代碼翻譯的策略......