CUDA塊並行性

我在CUDA中編寫了一些代碼，並且對實際運行的並行性有些困惑。CUDA塊並行性

說我打電話給這樣的內核函數：kenel_foo<<<A, B>>>。現在按照下面的設備查詢，每塊最多可以有512個線程。那麼我保證每次運行kernel_foo<<<A, 512>>>時，我將每塊有512個計算？但是它說here一個線程在一個CUDA核心上運行，這意味着我可以一次同時運行96個線程？（請參閱下面的device_query）。
我想知道有關塊。每次我打電話給kernel_foo<<<A, 512>>>時，多少次計算是並行完成的，以及如何進行？我的意思是它是在一個塊之後完成的，還是塊被並行化了？如果是，那麼多少個塊可以並行運行512個線程？它說here一個塊在一個CUDA SM上運行，那麼是否可以同時運行12個塊？如果是，當所有12個塊同時運行時，每個塊最多可以同時運行多少個線程，8,96或512個線程？（請參閱下面的device_query）。
另一個問題是，如果A的值爲50，那麼啓動內核爲kernel_foo<<<A, 512>>>或kernel_foo<<<512, A>>>更好嗎？假設不需要線程同步。

對不起，這可能是基本的問題，但它是一種複雜的...可能的重複：
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?

感謝

這裏是我的device_query：

Device 0: "Quadro FX 4600" 
CUDA Driver Version/Runtime Version   4.2/4.2 
CUDA Capability Major/Minor version number: 1.0 
Total amount of global memory:     768 MBytes (804978688 bytes) 
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores 
GPU Clock rate:        1200 MHz (1.20 GHz) 
Memory Clock rate:        700 Mhz 
Memory Bus Width:        384-bit 
Max Texture Dimension Size (x,y,z)    1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048) 
Max Layered Texture Size (dim) x layers  1D=(8192) x 512, 2D=(8192,8192) x 512 
Total amount of constant memory:    65536 bytes 
Total amount of shared memory per block:  16384 bytes 
Total number of registers available per block: 8192 
Warp size:          32 
Maximum number of threads per multiprocessor: 768 
Maximum number of threads per block:   512 
Maximum sizes of each dimension of a block: 512 x 512 x 64 
Maximum sizes of each dimension of a grid:  65535 x 65535 x 1 
Maximum memory pitch:       2147483647 bytes 
Texture alignment:        256 bytes 
Concurrent copy and execution:     No with 0 copy engine(s) 
Run time limit on kernels:      Yes 
Integrated GPU sharing Host Memory:   No 
Support host page-locked memory mapping:  No 
Concurrent kernel execution:     No 
Alignment requirement for Surfaces:   Yes 
Device has ECC support enabled:    No 
Device is using TCC driver mode:    No 
Device supports Unified Addressing (UVA):  No 
Device PCI Bus ID/PCI location ID:   2/0

來源

2013-02-12 vegeta

檢查出this answer一些第一指針！答案有點過時，因爲它正在討論具有計算能力1.x的較舊GPU，但無論如何它都與您的GPU相匹配。較新的GPU（2.x和3.x）具有不同的參數（每個SM的內核數量等），但是一旦理解了線程和塊的概念以及超額訂閱隱藏等待時間，這些更改就很容易找到。

此外，你可以採取this Udacity course或this Coursera course開始。

來源

2013-02-12 12:45:49 Tom

回答

相關問題