使用CUDA的Thrust庫以獲得較大的值

嗨我想實現一個非常大的循環，但我發現它比普通的C++代碼慢得多。你能告訴我我哪裏錯了嗎？ Fi和FJ是宿主載體使用CUDA的Thrust庫以獲得較大的值

XSIZE通常約爲7-8位數字

thrust::host_vector <double> df((2*floor(r)*(floor(r)+1)+1)*n*n); 
thrust::device_vector<double> gpu_df((2*floor(r)*(floor(r)+1)+1)*n*n); 
    for(i=0;i<xsize;i++) 
    { 
     gpu_df[i]=(fi[i]-fj[i]); 

     if(gpu_df[i]<0) 
      gpu_df[i]=0; 
     else 
     gpu_df[i]=gpu_df[i]*(fi[i]-fj[i]); 
     if(gpu_df[i]>255) 
      gpu_df[i]=255; 
     //  cout<<fi[i]<<"\n"; 
    } 
df=gpu_df;

感覺的代碼不被並行化。你能幫我解決嗎？

來源

2011-06-14 Madhu

運行與推力在GPU上的程序，你需要將它們寫在推力方面的算法像reduce，transform，sort，等等。在這種情況下，我們可以寫在transform方面的計算，由於環路只是計算函數F(fi[i], fj[i])並將結果存儲在df[i]中。請注意，我們必須先將輸入數組移到設備，然後再調用transform，因爲Thrust要求輸入和輸出數組位於相同的位置。

#include <thrust/host_vector.h> 
#include <thrust/device_vector.h> 
#include <thrust/functional.h> 
#include <cstdio> 

struct my_functor 
    : public thrust::binary_function<float,float,float> 
{ 
    __host__ __device__ 
    float operator()(float fi, float fj) 
     { 
    float d = fi - fj; 

    if (d < 0) 
     d = 0; 
    else 
     d = d * d; 

    if (d > 255) 
     d = 255; 

    return d; 
    } 
}; 

int main(void) 
{ 
    size_t N = 5; 

    // allocate storage on host 
    thrust::host_vector<float> cpu_fi(N); 
    thrust::host_vector<float> cpu_fj(N); 
    thrust::host_vector<float> cpu_df(N); 

    // initialze fi and fj arrays 
    cpu_fi[0] = 2.0; cpu_fj[0] = 0.0; 
    cpu_fi[1] = 0.0; cpu_fj[1] = 2.0; 
    cpu_fi[2] = 3.0; cpu_fj[2] = 1.0; 
    cpu_fi[3] = 4.0; cpu_fj[3] = 5.0; 
    cpu_fi[4] = 8.0; cpu_fj[4] = -8.0; 

    // copy fi and fj to device 
    thrust::device_vector<float> gpu_fi = cpu_fi; 
    thrust::device_vector<float> gpu_fj = cpu_fj; 

    // allocate storage for df 
    thrust::device_vector<float> gpu_df(N); 

    // perform transformation 
    thrust::transform(gpu_fi.begin(), gpu_fi.end(), // first input range 
        gpu_fj.begin(),    // second input range 
        gpu_df.begin(),    // output range 
        my_functor());     // functor to apply 

    // copy results back to host 
    thrust::copy(gpu_df.begin(), gpu_df.end(), cpu_df.begin()); 

    // print results on host 
    for (size_t i = 0; i < N; i++) 
    printf("f(%2.0lf,%2.0lf) = %3.0lf\n", cpu_fi[i], cpu_fj[i], cpu_df[i]); 

    return 0; 
}

僅供參考，這裏是程序的輸出：

f(2, 0) = 4 
f(0, 2) = 0 
f(3, 1) = 4 
f(4, 5) = 0 
f(8,-8) = 255

來源

2011-06-15 03:11:15 wnbell

謝謝wnbell。它也可以用於2D矢量。我想要像xi [i] [0] -xj [i] [0]這樣它會像xi [0] -xj [0]？ – Madhu 2011-06-15 04:35:12

Thrust只提供一維矢量容器，所以你必須決定如何將2D矢量壓縮成1d矢量。假設你用同樣的方法將所有二維矢量平坦化，當調用諸如「transform」之類的算法時，通常可以忽略數據的二維性質。 – wnbell 2011-06-15 14:49:23

使用CUDA的Thrust庫以獲得較大的值

回答

相關問題