可以通過調用__reduce()來減少大數組。多次。OpenCL/CUDA:兩階段約簡算法
下面的代碼但只使用兩個階段,並記錄在案here:
但是我無法理解這一兩個階段減少的算法。有人可以給出一個更簡單的解釋嗎?
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0)/2; offset > 0; offset = offset/2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
它也可以使用CUDA很好地實現。
此並行簡化代碼是規範CUDA簡化的簡單OpenCL端口(缺少OpenCL無法完成的幾項優化)。它由NVIDIA的[Mark Harris](http://stackoverflow.com/users/749748/harrism)編寫。您可以在CUDA SDK [縮小示例](http://developer.nvidia.com/cuda-cc-sdk-code-samples#reduction)中找到一個非常有教育意義的白皮書。閱讀完白皮書後,編輯您的問題以解釋您不明白的內容,並且有人可能會進一步提供幫助。 – talonmies
感謝您指向正確的方向。那麼我很難理解兩階段和多階段縮減內核之間的區別,因爲[在這裏](http://developer.amd.com/Membership/Print.aspx?ArticleID=221&web=http://developer.amd .com /文檔/文章) – gpuguy