2012-09-17 30 views
1

我正在嘗試通過cuda執行三倍黎曼幣。我正在嘗試爲我的sum迭代器使用多維網格迭代器來避免嵌套循環。我使用的是2.0 telsa卡,所以我無法使用嵌套的內核。cuda triple riemann sum

它似乎並沒有得到我需要的每個x,y,z變量的全0 - > N迭代。

__global__ void test(){ 
uint xIteration = blockDim.x * blockIdx.x + threadIdx.x; 
uint yIteration = blockDim.y * blockIdx.y + threadIdx.y; 
uint zIteration = blockDim.z * blockIdx.z + threadIdx.z; 
printf("x: %d * %d + %d = %d\n y: %d * %d + %d = %d\n z: %d * %d + %d = %d\n", blockDim.x, blockIdx.x, threadIdx.x, xIteration, blockDim.y, blockIdx.y, threadIdx.y, yIteration, blockDim.z, blockIdx.z, threadIdx.z, zIteration); 
} 

----由-----

int totalIterations = 128; // N value for single sum (i = 0; i < N) 
dim3 threadsPerBlock(8,8,8); 
dim3 blocksPerGrid((totalIterations + threadsPerBlock.x - 1)/threadsPerBlock.x, 
        (totalIterations + threadsPerBlock.y - 1)/threadsPerBlock.y, 
        (totalIterations + threadsPerBlock.z - 1)/threadsPerBlock.z); 
test<<<blocksPerGrid, threadsPerBlock>>>(); 

叫---- -----輸出

x y z 
... 
7 4 0 
7 4 1 
7 4 2 
7 4 3 
7 4 4 
7 4 5 
7 4 6 
7 4 7 
7 5 0 
7 5 1 
7 5 2 
7 5 3 
7 5 4 
7 5 5 
7 5 6 
7 5 7 
7 6 0 
7 6 1 
7 6 2 
7 6 3 
7 6 4 
7 6 5 
7 6 6 
7 6 7 
7 7 0 
7 7 1 
7 7 2 
7 7 3 
7 7 4 
7 7 5 
7 7 6 
7 7 7 
... 

輸出截斷,我現在越來越每0 < x,y,z < 7,但是當totalIterations是128時,我需要0 < x,y,z < 127.例如,在此執行中,40 < z < 0 49,它應該是0 < = z < = 127.我對多重暗淡網格的理解可能是錯誤的,但對於黎曼,每個迭代器x,y和z必須具有0到127的值。

此外,如果我使totalIterations> 128,例如1024,該程序死亡與cudaError代碼6,我知道這是一個啓動計時器到期。內核除了打印之外什麼都不做,所以我不明白爲什麼它會超時。在輔助設備上運行它似乎可以解決此問題。我們正在使用其中一個特斯拉運行X,但geforce正在郵件中成爲新的顯示設備,以釋放兩個teslas用於計算。

printf(...)將被執行的功能所取代。

的想法是,以取代

for (int i = 0...) 
    for (int j = 0 ..) 
     for (int k = 0...) 

而且串行代碼版本,我不知道如何將函數值存儲,因爲它似乎並不內存使用效率,創造一個潛在的巨大的(百萬X百萬X百萬)3D數組,然後減少它,但以某種方式將函數值連接成某種共享變量。

----設備信息(我們有2個這些卡,輸出是兩個相同----

Device 1: "Tesla C2050" 
    CUDA Driver Version/Runtime Version   5.0/5.0 
    CUDA Capability Major/Minor version number: 2.0 
    Total amount of global memory:     2687 MBytes (2817982464 bytes) 
    (14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores 
    GPU Clock rate:        1147 MHz (1.15 GHz) 
    Memory Clock rate:        1500 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         786432 bytes 
    Max Texture Dimension Size (x,y,z)    1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048) 
    Max Layered Texture Size (dim) x layers  1D=(16384) x 2048, 2D=(16384,16384) x 2048 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 32768 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 1536 
    Maximum number of threads per block:   1024 
    Maximum sizes of each dimension of a block: 1024 x 1024 x 64 
    Maximum sizes of each dimension of a grid:  65535 x 65535 x 65535 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and execution:     Yes with 2 copy engine(s) 
    Run time limit on kernels:      No 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Concurrent kernel execution:     Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support enabled:    Yes 
    Device is using TCC driver mode:    No 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Bus ID/PCI location ID:   132/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 
+0

第一件事是第一件事:你期望的產出是什麼,你得到的產出是多少? totalIterations的價值是多少?這是否意味着每個維度的總數或整體總數(X * Y * Z迭代)?關於減少,你是對的 - 你會想要即時減少,而不是存儲到內存,然後減少。共享和全局臨時存儲的組合將是您最好的選擇。但首先你需要回答上述問題... – harrism

+0

totalIterations是一個單一的維度(當前的X,Y,Z都是相同的大小)。我希望將xIteration,yIteration和zIteration的每個整數值從0到totalIteration。每次執行時我都會得到不同值的每個迭代器,但是我從來沒有得到一組與x,y,z的每個置換相對應的值。期望將用於總Iterations = 2;一個線程,每個x,y,z的值。一個線程會將迭代器的值設爲0,0,0,然後是1,0,0,然後是1,1,0,1,0,1等,直到每個排列都被執行。 – Jim

+0

當需要更多細節時,最好將該細節添加到問題中(單擊「編輯」)。你可以在問題中發佈具體的示例輸入,預期輸出,實際輸出嗎? – harrism

回答

1

我認爲,作爲已經被提到,在設備代碼用printf來驗證(x,y,z)數組中的每個元素都被一個線程所觸及,對於x,y,z的大數值是不明智的。

我根據您的代碼創建了以下代碼,以證明每個元素x,y ,z被線程所感動:

#include <stdio.h> 
#define DATAVAL 1 
#define cudaCheckErrors(msg) \ 
    do { \ 
     cudaError_t __err = cudaGetLastError(); \ 
     if (__err != cudaSuccess) { \ 
      fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ 
       msg, cudaGetErrorString(__err), \ 
       __FILE__, __LINE__); \ 
      fprintf(stderr, "*** FAILED - ABORTING\n"); \ 
      exit(1); \ 
     } \ 
    } while (0) 

__global__ void test(int *data, int dim){ 
    uint xIteration = blockDim.x * blockIdx.x + threadIdx.x; 
    uint yIteration = blockDim.y * blockIdx.y + threadIdx.y; 
    uint zIteration = blockDim.z * blockIdx.z + threadIdx.z; 

    data[((((zIteration*dim)+yIteration)*dim)+xIteration)]=DATAVAL; 
} 

int main(){ 
    int *testdata; 
    int *result; 
    int totalIterations = 128; // N value for single sum (i = 0; i < N) 
    int testsize = totalIterations*totalIterations*totalIterations; 
    dim3 threadsPerBlock(8,8,8); 
    dim3 blocksPerGrid((totalIterations + threadsPerBlock.x - 1)/threadsPerBlock.x, (totalIterations + threadsPerBlock.y - 1)/threadsPerBlock.y, (totalIterations + threadsPerBlock.z - 1)/threadsPerBlock.z); 
    cudaMalloc(&testdata, testsize*sizeof(int)); 
    cudaCheckErrors("cudaMalloc fail"); 
    cudaMemset(testdata, 0, testsize*sizeof(int)); 
    cudaCheckErrors("cudaMemset fail"); 
    result=(int *)malloc(testsize*sizeof(int)); 
    if (result == 0) {printf("malloc fail \n"); return 1;} 
    memset(result, 0, testsize*sizeof(int)); 
    test<<<blocksPerGrid, threadsPerBlock>>>(testdata, totalIterations); 
    cudaDeviceSynchronize(); 
    cudaCheckErrors("Kernel launch failure"); 
    cudaMemcpy(result, testdata, testsize*sizeof(int), cudaMemcpyDeviceToHost); 
    cudaCheckErrors("cudaMemcpy failure"); 

    for (unsigned i=0; i<testsize; i++) 
    if (result[i] != DATAVAL) {printf("fail! \n"); return 1;} 

    printf("Success \n"); 
    return 0; 

}