CUDA二進制搜索執行

我想加快CPU二進制搜索。不幸的是，GPU版本總是比CPU版本慢得多。也許問題不適合GPU或者我做錯了什麼？CUDA二進制搜索執行

CPU版本（約0.6ms）：使用長度爲2000的排序後的數組和特定值做二進制搜索

... 
Lookup (search[j], search_array, array_length, m); 
... 
int Lookup (int search, int* arr, int length, int& m) 
{  
    int l(0), r(length-1); 
    while (l <= r) 
    { 
     m = (l+r)/2;  
     if (search < arr[m]) 
     r = m-1; 
     else if (search > arr[m]) 
     l = m+1; 
     else 
     {   
     return index[m]; 
     }   
    } 
    if (arr[m] >= search) 
     return m; 
    return (m+1);  
}

GPU版本（約20毫秒）：使用長度2000的排序後的數組，並做二進制搜索具體價值

.... 
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val); 
.... 

__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val) 
{ 
    const int num_threads = blockDim.x * gridDim.x; 
    const int thread = blockIdx.x * blockDim.x + threadIdx.x; 
    int set_size = array_length; 

    ret_val[0] = -1; // return value 
    ret_val[1] = 0; // offset 

    while(set_size != 0) 
    { 
     // Get the offset of the array, initially set to 0 
     int offset = ret_val[1]; 

     // I think this is necessary in case a thread gets ahead, and resets offset before it's read 
     // This isn't necessary for the unit tests to pass, but I still like it here 
     __syncthreads(); 

     // Get the next index to check 
     int index_to_check = get_index_to_check(thread, num_threads, set_size, offset); 

     // If the index is outside the bounds of the array then lets not check it 
     if (index_to_check < array_length) 
     { 
     // If the next index is outside the bounds of the array, then set it to maximum array size 
     int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset); 
     if (next_index_to_check >= array_length) 
     { 
      next_index_to_check = array_length - 1; 
     } 

     // If we're at the mid section of the array reset the offset to this index 
     if (search > arr[index_to_check] && (search < arr[next_index_to_check])) 
     { 
      ret_val[1] = index_to_check; 
     } 
     else if (search == arr[index_to_check]) 
     { 
      // Set the return var if we hit it 
      ret_val[0] = index_to_check; 
     } 
     } 

     // Since this is a p-ary search divide by our total threads to get the next set size 
     set_size = set_size/num_threads; 

     // Sync up so no threads jump ahead and get a bad offset 
     __syncthreads(); 
    } 
}

即使我嘗試更大的陣列，時間比例並沒有更好的。

來源

2011-08-16 Izidor

簡單的二進制搜索並不完全適合GPU操作。這是一個無法並行化的串行操作。但是，您可以將數組拆分爲小塊，然後在每個塊上執行二進制搜索。創建X塊，確定哪些可能包含X並行線程中的變量。拋出所有，但候選人，進一步細分，等等... –

您可能想要檢查推測二進制搜索在http://wiki.thrust.googlecode.com/hg/html/group__binary__search.html – jmsu

你的代碼中有太多不同的分支，所以你基本上是序列化GPU上的整個過程。您想分解工作，以便同一個warp中的所有線程在分支中採用相同的路徑。請參閱CUDA Best Practices Guide的第47頁。

來源

2011-08-16 16:04:22 SpaceghostAli

我使用了數組2000元素。並使用編號爲395的二進制搜索的CPU版本。在我的PC上它僅花費0.000933ms。爲了測試，我創建了內核<<<2000,1> >>，並且內核完全沒有做任何事情：__global__ void Search（） {td = threadIdx.x + blockIdx.x * blockDim.x; if（tid <2000） { } }並且僅僅調用它就花了0.034704毫秒。從這個結果我真的想知道是否有意義使用CUDA來使事情變得更快。或者我做錯了什麼...... – Izidor

這真的就是這樣，CUDA作爲一些需要一些時間的開銷，但是當你正在做的事情，例如，在CPU上10秒，並且GPU可以使它即使有0.03秒的開銷，你會更喜歡哪一種？ CUDA絕對有效，但如果在CPU上它已經非常快，它可能不值得。 – jmsu

謝謝。我將嘗試在GPU上傳輸更多PC上的工作，我希望這能夠彌補現有的開銷。我認爲沒有CPU和GPU之間的內存拷貝，開銷已經很小，但顯然不是。我也將檢查「推力」二進制搜索。 – Izidor

我必須承認我不完全確定你的內核是做什麼的，但是我認爲你正在尋找一個滿足你的搜索條件的索引嗎？如果是這樣，那麼查看一下CUDA附帶的簡化示例，瞭解如何構建和優化這樣的查詢。（什麼你正在做的基本上是試圖最接近的索引減少到您的查詢），儘管

一些簡單的指針：

的讀取和寫入全局存儲器，這是令人難以置信的慢你正在執行一個可怕的很多。嘗試使用共享內存。其次請記住，__syncthreads（）只能同步同一個塊中的線程，因此您對全局內存的讀取/寫入不一定會在所有線程中同步（儘管全局內存寫入的延遲可能實際上使其顯示爲如果他們這樣做）

來源

2012-09-09 19:15:07

CUDA二進制搜索執行

回答

相關問題