cuda triple riemann sum

我正在嘗試通過cuda執行三倍黎曼幣。我正在嘗試爲我的sum迭代器使用多維網格迭代器來避免嵌套循環。我使用的是2.0 telsa卡，所以我無法使用嵌套的內核。cuda triple riemann sum

它似乎並沒有得到我需要的每個x，y，z變量的全0 - > N迭代。

__global__ void test(){ 
uint xIteration = blockDim.x * blockIdx.x + threadIdx.x; 
uint yIteration = blockDim.y * blockIdx.y + threadIdx.y; 
uint zIteration = blockDim.z * blockIdx.z + threadIdx.z; 
printf("x: %d * %d + %d = %d\n y: %d * %d + %d = %d\n z: %d * %d + %d = %d\n", blockDim.x, blockIdx.x, threadIdx.x, xIteration, blockDim.y, blockIdx.y, threadIdx.y, yIteration, blockDim.z, blockIdx.z, threadIdx.z, zIteration); 
}

----由-----

int totalIterations = 128; // N value for single sum (i = 0; i < N) 
dim3 threadsPerBlock(8,8,8); 
dim3 blocksPerGrid((totalIterations + threadsPerBlock.x - 1)/threadsPerBlock.x, 
        (totalIterations + threadsPerBlock.y - 1)/threadsPerBlock.y, 
        (totalIterations + threadsPerBlock.z - 1)/threadsPerBlock.z); 
test<<<blocksPerGrid, threadsPerBlock>>>();

叫---- -----輸出

輸出截斷，我現在越來越每0 < x，y，z < 7，但是當totalIterations是128時，我需要0 < x，y，z < 127.例如，在此執行中，40 < z < 0 49，它應該是0 < = z < = 127.我對多重暗淡網格的理解可能是錯誤的，但對於黎曼，每個迭代器x，y和z必須具有0到127的值。

此外，如果我使totalIterations> 128，例如1024，該程序死亡與cudaError代碼6，我知道這是一個啓動計時器到期。內核除了打印之外什麼都不做，所以我不明白爲什麼它會超時。在輔助設備上運行它似乎可以解決此問題。我們正在使用其中一個特斯拉運行X，但geforce正在郵件中成爲新的顯示設備，以釋放兩個teslas用於計算。

printf（...）將被執行的功能所取代。

的想法是，以取代

for (int i = 0...) 
    for (int j = 0 ..) 
     for (int k = 0...)

而且串行代碼版本，我不知道如何將函數值存儲，因爲它似乎並不內存使用效率，創造一個潛在的巨大的（百萬X百萬X百萬）3D數組，然後減少它，但以某種方式將函數值連接成某種共享變量。

----設備信息（我們有2個這些卡，輸出是兩個相同----

Device 1: "Tesla C2050" 
    CUDA Driver Version/Runtime Version   5.0/5.0 
    CUDA Capability Major/Minor version number: 2.0 
    Total amount of global memory:     2687 MBytes (2817982464 bytes) 
    (14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores 
    GPU Clock rate:        1147 MHz (1.15 GHz) 
    Memory Clock rate:        1500 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         786432 bytes 
    Max Texture Dimension Size (x,y,z)    1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048) 
    Max Layered Texture Size (dim) x layers  1D=(16384) x 2048, 2D=(16384,16384) x 2048 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 32768 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 1536 
    Maximum number of threads per block:   1024 
    Maximum sizes of each dimension of a block: 1024 x 1024 x 64 
    Maximum sizes of each dimension of a grid:  65535 x 65535 x 65535 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and execution:     Yes with 2 copy engine(s) 
    Run time limit on kernels:      No 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Concurrent kernel execution:     Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support enabled:    Yes 
    Device is using TCC driver mode:    No 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Bus ID/PCI location ID:   132/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

來源

2012-09-17 Jim

第一件事是第一件事：你期望的產出是什麼，你得到的產出是多少？ totalIterations的價值是多少？這是否意味着每個維度的總數或整體總數（X * Y * Z迭代）？關於減少，你是對的 - 你會想要即時減少，而不是存儲到內存，然後減少。共享和全局臨時存儲的組合將是您最好的選擇。但首先你需要回答上述問題... – harrism

totalIterations是一個單一的維度（當前的X，Y，Z都是相同的大小）。我希望將xIteration，yIteration和zIteration的每個整數值從0到totalIteration。每次執行時我都會得到不同值的每個迭代器，但是我從來沒有得到一組與x，y，z的每個置換相對應的值。期望將用於總Iterations = 2;一個線程，每個x，y，z的值。一個線程會將迭代器的值設爲0,0,0，然後是1,0,0，然後是1,1,0,1,0,1等，直到每個排列都被執行。 – Jim

當需要更多細節時，最好將該細節添加到問題中（單擊「編輯」）。你可以在問題中發佈具體的示例輸入，預期輸出，實際輸出嗎？ – harrism

我認爲，作爲已經被提到，在設備代碼用printf來驗證（x，y，z）數組中的每個元素都被一個線程所觸及，對於x，y，z的大數值是不明智的。

我根據您的代碼創建了以下代碼，以證明每個元素x，y ，z被線程所感動：

#include <stdio.h> 
#define DATAVAL 1 
#define cudaCheckErrors(msg) \ 
    do { \ 
     cudaError_t __err = cudaGetLastError(); \ 
     if (__err != cudaSuccess) { \ 
      fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ 
       msg, cudaGetErrorString(__err), \ 
       __FILE__, __LINE__); \ 
      fprintf(stderr, "*** FAILED - ABORTING\n"); \ 
      exit(1); \ 
     } \ 
    } while (0) 

__global__ void test(int *data, int dim){ 
    uint xIteration = blockDim.x * blockIdx.x + threadIdx.x; 
    uint yIteration = blockDim.y * blockIdx.y + threadIdx.y; 
    uint zIteration = blockDim.z * blockIdx.z + threadIdx.z; 

    data[((((zIteration*dim)+yIteration)*dim)+xIteration)]=DATAVAL; 
} 

int main(){ 
    int *testdata; 
    int *result; 
    int totalIterations = 128; // N value for single sum (i = 0; i < N) 
    int testsize = totalIterations*totalIterations*totalIterations; 
    dim3 threadsPerBlock(8,8,8); 
    dim3 blocksPerGrid((totalIterations + threadsPerBlock.x - 1)/threadsPerBlock.x, (totalIterations + threadsPerBlock.y - 1)/threadsPerBlock.y, (totalIterations + threadsPerBlock.z - 1)/threadsPerBlock.z); 
    cudaMalloc(&testdata, testsize*sizeof(int)); 
    cudaCheckErrors("cudaMalloc fail"); 
    cudaMemset(testdata, 0, testsize*sizeof(int)); 
    cudaCheckErrors("cudaMemset fail"); 
    result=(int *)malloc(testsize*sizeof(int)); 
    if (result == 0) {printf("malloc fail \n"); return 1;} 
    memset(result, 0, testsize*sizeof(int)); 
    test<<<blocksPerGrid, threadsPerBlock>>>(testdata, totalIterations); 
    cudaDeviceSynchronize(); 
    cudaCheckErrors("Kernel launch failure"); 
    cudaMemcpy(result, testdata, testsize*sizeof(int), cudaMemcpyDeviceToHost); 
    cudaCheckErrors("cudaMemcpy failure"); 

    for (unsigned i=0; i<testsize; i++) 
    if (result[i] != DATAVAL) {printf("fail! \n"); return 1;} 

    printf("Success \n"); 
    return 0; 

}

來源

2012-09-25 03:41:55

cuda triple riemann sum

回答

相關問題