cuda內核中用於維護優先級隊列的流壓縮

我正在尋找我的cuda程序的優化策略。在我內核的for循環內的每次迭代中，每個線程都會產生一個分數。我保持分數的共享優先級隊列以保持每塊的top-k。請參閱下面的僞代碼：cuda內核中用於維護優先級隊列的流壓縮

__global__ gpuCompute(... arguments) 
{ 
    __shared__ myPriorityQueue[k]; //To maintain top k scores (k < #threads in this block) 
    __shared__ scoresToInsertInQueue[#threadsInBlock]; 
    __shared__ counter; 
    for(...)  //About 1-100 million iterations 
    { 
     int score = calculate_score(...); 
     if(score > Minimum element of P. Queue && ...) 
     { 
      ATOMIC Operation : localCounter = counter++; 
      scoresToInsertInQueue[localCounter] = score; 
     } 
     __syncthreads(); 
     //Merge scores from scoresToInsertInQueue to myPriorityQueue 
     while(counter>0) 
     { 
      //Parallel insertion of scoresToInsertInQueue[counter] to myPriorityQueue using the participation of k threads in this block 
      counter--; 
      __syncthreads(); 
     } 
     __syncthreads(); 
    } 
}

希望上面的代碼對你們有意義。現在，我正在尋找一種方法來刪除原子操作開銷s.t.每個線程根據值保存'1'或'0'應該進入優先隊列。我想知道在內核中是否有任何流式壓縮的實現，以便我可以將'1000000000100000000'減少到'11000000000000000000'緩衝區（或知道'1'的索引），並最後在隊列中插入對應'1'的分數。
請注意，在這種情況下，1將非常稀疏。

來源

2014-02-10 Chirag Jain

[thrust]（https://github.com/thrust/thrust/wiki/Quick-Start-Guide）具有[流式壓縮功能]（http://thrust.github.io/doc/group__stream__compaction.html ），但它似乎需要你把你的內核分解成幾塊。 –

@Robert：是的，我讀到了這個，這個解決方案適用於這個[問題]（https://devtalk.nvidia.com/default/topic/453910/?comment=3228624）。將其分解爲多個內核將導致性能非常糟糕。 –

如果這些非常稀疏，那麼atomic方法可能是最快的。然而，我在這裏描述的方法將會具有更可預測和有限的最差情況表現。

爲1和0的在你的決定陣列混的好，它可能是更快地使用並行掃描或prefix-sum建立一個插入點指數排列出決定數組：

假設我有一個閾值決定選擇> 30分進入隊列。我的數據可能是這樣的：

scores:  30 32 28 77 55 12 19 
score > 30: 0 1 0 1 1 0 0 
insert_pt: 0 0 1 1 2 3 3 (an "exclusive prefix sum")

然後，每個線程進行存儲的選擇如下：

if (score[threadIdx.x] > 30) temp[threadIdx.x] = 1; 
else temp[threadIdx.x] = 0; 
__syncthreads(); 
// perform exclusive scan on temp array into insert_pt array 
__syncthreads(); 
if (temp[threadIdx.x] == 1) 
    myPriorityQueue[insert_pt[threadIdx.x]] = score[threadIdx.x];

CUB有一個快速的並行前綴掃描。

來源

2014-02-10 19:51:39

cuda內核中用於維護優先級隊列的流壓縮

回答

相關問題