OpenCL內存優化 - 最近的鄰居

我正在OpenCL中編寫一個程序，它接收兩個點的數組，並計算每個點的最近鄰居。OpenCL內存優化 - 最近的鄰居

我有兩個這樣的程序。其中一人將計算4維的距離，6維的距離。他們在下面：

4個維度：

kernel void BruteForce(
    global read_only float4* m, 
    global float4* y, 
    global write_only ushort* i, 
    read_only uint mx) 
{ 
    int index = get_global_id(0); 
    float4 curY = y[index]; 

    float minDist = MAXFLOAT; 
    ushort minIdx = -1; 
    int x = 0; 
    int mmx = mx; 
    for(x = 0; x < mmx; x++) 
    { 
     float dist = fast_distance(curY, m[x]); 
     if (dist < minDist) 
     { 
      minDist = dist; 
      minIdx = x; 
     } 
    } 
    i[index] = minIdx; 
    y[index] = minDist; 
}

6個維度：

kernel void BruteForce(
    global read_only float8* m, 
    global float8* y, 
    global write_only ushort* i, 
    read_only uint mx) 
{ 
    int index = get_global_id(0); 
    float8 curY = y[index]; 

    float minDist = MAXFLOAT; 
    ushort minIdx = -1; 
    int x = 0; 
    int mmx = mx; 
    for(x = 0; x < mmx; x++) 
    { 
     float8 mx = m[x]; 
     float d0 = mx.s0 - curY.s0; 
     float d1 = mx.s1 - curY.s1; 
     float d2 = mx.s2 - curY.s2; 
     float d3 = mx.s3 - curY.s3; 
     float d4 = mx.s4 - curY.s4; 
     float d5 = mx.s5 - curY.s5; 

     float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5); 
     if (dist < minDist) 
     { 
      minDist = dist; 
      minIdx = index; 
     } 
    } 
    i[index] = minIdx; 
    y[index] = minDist; 
}

我正在尋找方法來優化這個程序GPGPU。我已經閱讀了一些關於GPGPU優化的文章（包括http://www.macresearch.org/opencl_episode6，附帶源代碼），通過使用本地內存。我試着將它與這個代碼想出了：

kernel void BruteForce(
    global read_only float4* m, 
    global float4* y, 
    global write_only ushort* i, 
    __local float4 * shared) 
{ 
    int index = get_global_id(0); 
    int lsize = get_local_size(0); 
    int lid = get_local_id(0); 

    float4 curY = y[index]; 

    float minDist = MAXFLOAT; 
    ushort minIdx = 64000; 
    int x = 0; 
    for(x = 0; x < {0}; x += lsize) 
    { 
     if((x+lsize) > {0}) 
      lsize = {0} - x; 
     if ((x + lid) < {0}) 
     { 
      shared[lid] = m[x + lid]; 
     } 
     barrier(CLK_LOCAL_MEM_FENCE); 

     for (int x1 = 0; x1 < lsize; x1++) 
     { 
      float dist = distance(curY, shared[x1]); 

      if (dist < minDist) 
      { 
       minDist = dist; 
       minIdx = x + x1; 
      } 
     } 
     barrier(CLK_LOCAL_MEM_FENCE); 
    } 
    i[index] = minIdx; 
    y[index] = minDist; 
}

我收到垃圾結果我的「我」輸出（例如許多值是一樣的）。任何人都可以指向正確的方向嗎？我將不勝感激任何有助於我改進此代碼的答案，或者可能會發現上述優化版本的問題。

非常感謝您 Cauê

來源

2012-10-22 Waneck

你的數組通常有多大？對於6d版本，你介意最後兩個元素是完全使用還是完全刪除？ – mfa

感謝您的回答！我的數組非常大（每個都有〜400000個點）。我不介意完全放棄最後兩個元素。我只是添加了它們，因爲在我的測試中，一個非對齊的版本（使用float *而不是float8 *）拋出「資源不足」錯誤 – Waneck

也是4D，我實際上只使用3個維度，所以這可能是也適用於此版本。 – Waneck

一個得到一個大的速度在這裏的方法是同時使用本地數據結構和計算整個數據塊。您還應該只需要一個讀/寫全局向量（float4）。使用更小的塊可以將相同的想法應用於6d版本。每個工作組都可以通過正在處理的數據塊自由工作。我將把確切的實現留給你，因爲你會知道你的應用程序的細節。

一些僞上下的代碼（4D）：

computeBlockSize is the size of the blocks to read from global and crunch. 
this value should be a multiple of your work group size. I like 64 as a WG 
size; it tends to perform well on most platforms. will be 
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory. 
(max value for ocl 1.0 ~448, ocl 1.1 ~896) 
#define computeBlockSize = 256 

__local float4[computeBlockSize] blockA; 
__local float4[computeBlockSize] blockB; 
__local uint[computeBlockSize] blockAnearestIndex; 

now blockA gets computed against all blockB combinations. this is the job of a single work group. 
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global 

steps: 
load blockA into local memory with async_work_group_copy 
blockA is located at get_group_id(0) * computeBlockSize in the global vector 
optional: set all blockA 'w' values to MAXFLOAT 
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed 


need to compute blockA against itself first, then go into the blockB's 
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item 
for(j=get_local_id(0); j<computeBlockSize;j++) 
    for(k=0;k<computeBlockSize; k++) 
    if(j==k) continue; //no self-comparison 
    calculate distance of blockA[j] vs blockA[k] 
    store min distance in blockA[j].w 
    store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j] 
barrier(local_mem_fence) 

for (i=0;i<get_num_groups(0);i++) 
    if (i==get_group_id(0)) continue; 
    load blockB into local memory: async_work_group_copy(...) 
    for(j=get_local_id(0); j<computeBlockSize;j++) 
    for(k=0;k<computeBlockSize; k++) 
     calculate distance of blockA[j] vs blockB[k] 
     store min distance in blockA[j].w 
     store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j] 
    barrier(local_mem_fence) 

write blockA and blockAnearestIndex to global memory using two async_work_group_copy

應該有在讀blockB沒有問題的，而另一個工作組寫入相同的塊（作爲自己BLOCKa中），因爲只有在W值可能已經改變。如果碰巧遇到這個問題 - 或者如果你確實需要兩個不同的矢量點，你可以使用兩個像上面那樣的全局矢量，一個用A（可寫），另一個用B（只讀）。

當您的全局數據大小爲computeBlockSize的倍數時，此算法效果最佳。爲了處理邊緣，想到兩種解決方案。我建議爲非方形邊緣塊編寫第二個內核，其方式與上面類似。新內核可以在第一個之後執行，並且可以保存第二個pci-e傳輸。或者，您可以使用-1的距離來表示兩個元素（即blockA [j] .w == -1或blockB [k] .w == -1，continue）的比較中的跳過。這第二個解決方案會導致內核中更多的分支，這就是爲什麼我建議編寫一個新內核的原因。您的數據點中的很小一部分實際上會落在邊緣塊中。

來源

2012-10-23 13:54:55 mfa

非常感謝您的回答。當另一個項目出現時，我花了很長時間來回答它，而我沒有時間去檢查它。對於那個很抱歉！ – Waneck

OpenCL內存優化 - 最近的鄰居

回答

相關問題