那麼,我可以給你一個快速教程,但我不一定寫這一切都爲你。
所以首先,你會希望得到MS Visual Studio中建立與CUDA,這是本指南輕鬆以下:http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/
現在,你將要閱讀的NVIDIA CUDA編程指南(免費PDF格式),文檔和CUDA示例(我強烈建議學習CUDA的書)。
但是讓我們假設你還沒有這樣做,並且肯定會在以後。
這是一個非常重的算術運算和數據光計算 - 實際上它可以在不使用這個蠻力方法的情況下進行相當簡單的計算,但這不是您正在尋找的答案。我建議像這樣的內核:
__global__ void kernel(int* myNumber, int* numOfHits){
//a shared value will be stored on-chip, which is beneficial since this is written to multiple times
//it is shared by all threads
__shared__ int s_hits = 0;
//this identifies the current thread uniquely
int i = (threadIdx.x + blockIdx.x*blockDim.x);
int j = (threadIdx.y + blockIdx.y*blockDim.y);
int k = 0;
//we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
for(; i < 100000; i += blockDim.x*gridDim.x){
for(; j < 100000; j += blockDim.y*gridDim.y){
//Thanks to talonmies for this simplification
if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
//you should actually use atomics for this
//otherwise, the value may change during the 'read, modify, write' process
s_hits++;
}
}
}
//synchronize threads, so we now s_hits is completely updated
__syncthreads();
//again, atomics
//we make sure only one thread per threadblock actually adds in s_hits
if(threadIdx.x == 0 && threadIdx.y == 0)
*numOfHits += s_hits;
return;
}
要啓動的內核,你會想是這樣的:
dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);
我知道你可能想快速的方法來做到這一點,但進入CUDA並不是一個「快速」的東西。如在,你將需要做一些閱讀和一些設置,以使其工作;過去,學習曲線並不太高。我還沒有告訴你任何關於內存分配的東西,所以你需要這樣做(儘管這很簡單)。如果你遵循我的代碼,我的目標是你必須閱讀共享內存和CUDA,所以你已經啓動了。祝你好運!
聲明:我沒有測試過我的代碼,我不是專家 - 它可能是愚蠢的。
你也可以直接計算'numOfHits',而不是用所有那些循環強制它...... – sth
你還沒有在這裏提問。你想知道什麼_exactly_? – talonmies