我正在嘗試通過cuda執行三倍黎曼幣。我正在嘗試爲我的sum迭代器使用多維網格迭代器來避免嵌套循環。我使用的是2.0 telsa卡,所以我無法使用嵌套的內核。cuda triple riemann sum
它似乎並沒有得到我需要的每個x,y,z變量的全0 - > N迭代。
__global__ void test(){
uint xIteration = blockDim.x * blockIdx.x + threadIdx.x;
uint yIteration = blockDim.y * blockIdx.y + threadIdx.y;
uint zIteration = blockDim.z * blockIdx.z + threadIdx.z;
printf("x: %d * %d + %d = %d\n y: %d * %d + %d = %d\n z: %d * %d + %d = %d\n", blockDim.x, blockIdx.x, threadIdx.x, xIteration, blockDim.y, blockIdx.y, threadIdx.y, yIteration, blockDim.z, blockIdx.z, threadIdx.z, zIteration);
}
----由-----
int totalIterations = 128; // N value for single sum (i = 0; i < N)
dim3 threadsPerBlock(8,8,8);
dim3 blocksPerGrid((totalIterations + threadsPerBlock.x - 1)/threadsPerBlock.x,
(totalIterations + threadsPerBlock.y - 1)/threadsPerBlock.y,
(totalIterations + threadsPerBlock.z - 1)/threadsPerBlock.z);
test<<<blocksPerGrid, threadsPerBlock>>>();
叫---- -----輸出
x y z
...
7 4 0
7 4 1
7 4 2
7 4 3
7 4 4
7 4 5
7 4 6
7 4 7
7 5 0
7 5 1
7 5 2
7 5 3
7 5 4
7 5 5
7 5 6
7 5 7
7 6 0
7 6 1
7 6 2
7 6 3
7 6 4
7 6 5
7 6 6
7 6 7
7 7 0
7 7 1
7 7 2
7 7 3
7 7 4
7 7 5
7 7 6
7 7 7
...
輸出截斷,我現在越來越每0 < x,y,z < 7,但是當totalIterations是128時,我需要0 < x,y,z < 127.例如,在此執行中,40 < z < 0 49,它應該是0 < = z < = 127.我對多重暗淡網格的理解可能是錯誤的,但對於黎曼,每個迭代器x,y和z必須具有0到127的值。
此外,如果我使totalIterations> 128,例如1024,該程序死亡與cudaError代碼6,我知道這是一個啓動計時器到期。內核除了打印之外什麼都不做,所以我不明白爲什麼它會超時。在輔助設備上運行它似乎可以解決此問題。我們正在使用其中一個特斯拉運行X,但geforce正在郵件中成爲新的顯示設備,以釋放兩個teslas用於計算。
printf(...)將被執行的功能所取代。
的想法是,以取代
for (int i = 0...)
for (int j = 0 ..)
for (int k = 0...)
而且串行代碼版本,我不知道如何將函數值存儲,因爲它似乎並不內存使用效率,創造一個潛在的巨大的(百萬X百萬X百萬)3D數組,然後減少它,但以某種方式將函數值連接成某種共享變量。
----設備信息(我們有2個這些卡,輸出是兩個相同----
Device 1: "Tesla C2050"
CUDA Driver Version/Runtime Version 5.0/5.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2687 MBytes (2817982464 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1500 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID/PCI location ID: 132/0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
第一件事是第一件事:你期望的產出是什麼,你得到的產出是多少? totalIterations的價值是多少?這是否意味着每個維度的總數或整體總數(X * Y * Z迭代)?關於減少,你是對的 - 你會想要即時減少,而不是存儲到內存,然後減少。共享和全局臨時存儲的組合將是您最好的選擇。但首先你需要回答上述問題... – harrism
totalIterations是一個單一的維度(當前的X,Y,Z都是相同的大小)。我希望將xIteration,yIteration和zIteration的每個整數值從0到totalIteration。每次執行時我都會得到不同值的每個迭代器,但是我從來沒有得到一組與x,y,z的每個置換相對應的值。期望將用於總Iterations = 2;一個線程,每個x,y,z的值。一個線程會將迭代器的值設爲0,0,0,然後是1,0,0,然後是1,1,0,1,0,1等,直到每個排列都被執行。 – Jim
當需要更多細節時,最好將該細節添加到問題中(單擊「編輯」)。你可以在問題中發佈具體的示例輸入,預期輸出,實際輸出嗎? – harrism