CUDA和並行搜索

我想要做以下事情：假設我有一個大小爲N（N相當大）的排序數字向量和一個數字x。我想在這個向量中並行搜索數字x的正確位置。例如：CUDA和並行搜索

myVector = [1，2，3，...，10000]並且x = 3.2，

然後我不得不返回3.第一線程來找到正確的位置應當中斷其他線程的工作。那麼花費的時間將會最小化：t = min（t_1，t_2，......，線程的t_number）您認爲使用多線程尋找正確位置可能會更快嗎？線程之間的通信如何？由於線程一旦紅色值與搜索結果不匹配，其他線程必須在搜索過程中跳過此值（可能是必須更改的布爾值。）

您是否有一些建議要共享關於這個算法？

來源

2013-03-07 ALFRAM

除非已排序的向量已經存在於設備內存中，否則對此使用CUDA是沒有意義的。 CPU上的二進制搜索具有複雜性log2 n。 – RoBiK 2013-03-07 10:16:16

您可能對[thrust :: lower_bound]感興趣（http://thrust.github.com/doc/group__binary__search.html）或[thrust :: partition_point]（http://thrust.github.com/doc/group__searching .html＃ga1b61bfe7c810941e02b723e050c805ba）如果你不熟悉推力，有一個[入門指南]（https://github.com/thrust/thrust/wiki/Quick-Start-Guide）。 – 2013-03-07 16:10:32

前一段時間我寫了下面的代碼，做類似的事情：

#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 

#include <stdio.h> 
#include <stdlib.h> 

__global__ void fast_finder(unsigned int *g_found, float x, float *y) 
{ 
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; 
    unsigned int pos = (unsigned int)(x == y[i]); 
    g_found[i * (1 - pos)] = i * pos; 
} 

int main(int argc, char *argv[]) 
{ 
    int N = 65536; 
    unsigned int h_found, *d_found; 
    float *h_y = (float *)malloc(N * sizeof(float)), *d_y, x = 5.0f; 
    int nThreads = 1024, nBloks = N/nThreads; 

    for (int i = 0; i < N; ++i) h_y[i] = (float)(N - i - 1); 

    if (x != h_y[0]) { 
     cudaSetDevice(0); 
     cudaMalloc((void **)&d_found, N * sizeof(unsigned int)); 
     cudaMalloc((void **)&d_y, N * sizeof(float)); 
     cudaMemcpy(d_y, h_y, N * sizeof(float), cudaMemcpyHostToDevice); 

     fast_finder<<<nBloks, nThreads>>>(d_found, x, d_y); 
     cudaThreadSynchronize(); 

     cudaMemcpy(&h_found, d_found, sizeof(unsigned int), cudaMemcpyDeviceToHost); 
     if (h_found) printf("%g found on %d. position!\n", x, h_found); 
     else printf("%g not found!\n", x); 

     cudaFree(d_y); 
     cudaFree(d_found); 

    } else printf("%g found on the first position!\n", x); 

    free(h_y); 

    getchar(); 
    return EXIT_SUCCESS; 
}

這裏每個線程檢查由全局線程指數y提供的值等於x。如果它是真的，則線程將其索引寫入g_found數組的第一個位置，否則將0寫入其索引提供的g_found的位置。對於長度爲16的y，在第11位包含在y值5的出來如下：

g_found = { 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }

在這種情況下y不需要被排序，但必須只包含唯一值。此代碼可伊斯利改變爲一個發現其中提供x將被插入（設備部件）指數，如下：這個版本

__global__ void fast_finder(unsigned int *g_found, float x, float *y) 
{ 
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; 
    unsigned int pos = (unsigned int)(x >= y[i] || x <= y[i+1]); 
    g_found[i * (1 - pos)] = (i + 1) * pos; 
}

輸出將類似於礦。當位置0處的g_found爲0時，x的值不存在於陣列y中。在內核被調用之前，通過主機代碼檢查y的第一個元素是否等於x。改變這個部分也不是一個問題，也不適用於你想要的條件。

正如您所看到的，在這樣的解決方案中，所有線程一起工作，並且不需要任何執行終止，只要找到x即可。好的事情也將是申請包搜索，意思是一個線程尋找y的一小部分，因此允許y要大得多。

來源

2013-03-07 10:26:46 Matso

謝謝你的幫助。 – ALFRAM 2013-03-08 14:04:30

無需線程和模塊之間的通信。您可以檢查，看看是否在當前索引值比預期更大。如果這樣返回，大多數線程將無法生存此檢查。

現在，您只有線索的索引值小於期望值，請檢查下一個值是否大於或等於查詢並返回相應索引。

這是我在上午5點寫的未經測試的內核。

template<typename ty> 
__global___ static void search(int *out, ty *list, ty val, int n) 
{ 
    int start = threadIdx.x + blockIdx.x * blockDim.x; 
    for (int idx = start; idx < n; idx += gridDim.x * blockDim.x) { 
     if (list[idx] >= val) return; 
     ty next = list[idx + 1]; 
     if (idx == n-1 || next >= val) { 
      *out = next == val ? (idx + 1) : idx; 
      return; 
     } 
    } 
}

這就是說，你真的不想這樣做。使用CPU時，可能會出現O（log n）的最差情況。這意味着搜索十億個元素可以分32步完成。除非你有數據已經在GPU上，並且想要避免內存拷貝，否則這在CPU上更好。

來源

2013-03-07 10:25:05

CUDA和並行搜索

回答

相關問題