Cuda - 不可逆 - 優文庫

using namespace std; 
#include <stdio.h> 
#include <stdlib.h> 
#include <iostream> 

const int threadsPerBlock = 256; 
const int blocksPerGrid = 1024; 
const int N = 64; 

__global__ void reverse(int *data, int count){ 

     __shared__ int cache[threadsPerBlock]; 
     int tid = threadIdx.x + blockIdx.x * blockDim.x; 

     int cacheIndex = threadIdx.x; 
     int tr = count-cacheIndex-1; 
     if(tid< count/2) 
     cache[cacheIndex] = data[cacheIndex]; 

     __syncthreads(); 
     data[cacheIndex] = cache[tr]; 
    } 

int main(void){ 

    int a[N]; 
    int *devA; 

    generate(a,N); 

    cudaMalloc((void**)&devA, N * sizeof(int)); 


    cudaMemcpy(devA, a, N * sizeof(int), cudaMemcpyHostToDevice); 

    reverse<<<blocksPerGrid,threadsPerBlock>>>(devA,N); 

    cudaMemcpy(a,devA, N * sizeof(int), cudaMemcpyDeviceToHost); 


    cout << a[63]; 

    cudaFree(devA); 

}

上面的代碼不反轉我的反向。這個程序有什麼問題？我錯了什麼？我認爲一切都很好。我需要編輯什麼才能正確工作？哪裏不對？Cuda - 不可逆

來源

2013-05-17 ehah

您正在啓動太多線程。對於你的算法，所需的線程數是N.但是你啓動了1024 * 256個線程。

另外，可能是良好的編碼習慣，將與線程檢查包裹的代碼在內核中，如：

int idx = threadIdx.x + blockDim.x*blockIdx.x; 

    if (idx<count){ 
     // put your kernel code here 
    }

而且，你的內核是用這樣一種方式，它會真的只適合適合單個線程塊的數據大小。

如果您僅查看@alrikai here提出的解決方案，則可能會更好。該解決方案不需要任何同步或使用共享內存，所以它比較簡單。

編輯回答下面的問題。

我犯了一個錯誤，因爲我在考慮alrikai的解決方案。我上面編輯了我的代碼。試試看。

來源

2013-05-17 19:26:58

我編輯了我的內核函數。請看看它。它不工作。 – ehah

爲什麼不使用我在我的答案中鏈接到的alrikai提出的內核？你剛纔編輯內核的問題是你不明白if語句應該做什麼。你沒有使用我暗示的花括號。 *所有你的內核代碼應該以我建議的if語句爲條件。 –

我想用共享內存來做。） – ehah

Cuda - 不可逆

回答

相關問題