我在CUDA中編寫了一些代碼,並且對實際運行的並行性有些困惑。CUDA塊並行性
說我打電話給這樣的內核函數:
kenel_foo<<<A, B>>>
。現在按照下面的設備查詢,每塊最多可以有512個線程。那麼我保證每次運行kernel_foo<<<A, 512>>>
時,我將每塊有512個計算?但是它說here一個線程在一個CUDA核心上運行,這意味着我可以一次同時運行96個線程? (請參閱下面的device_query)。我想知道有關塊。每次我打電話給
kernel_foo<<<A, 512>>>
時,多少次計算是並行完成的,以及如何進行?我的意思是它是在一個塊之後完成的,還是塊被並行化了?如果是,那麼多少個塊可以並行運行512個線程?它說here一個塊在一個CUDA SM上運行,那麼是否可以同時運行12個塊?如果是,當所有12個塊同時運行時,每個塊最多可以同時運行多少個線程,8,96或512個線程? (請參閱下面的device_query)。另一個問題是,如果
A
的值爲50,那麼啓動內核爲kernel_foo<<<A, 512>>>
或kernel_foo<<<512, A>>>
更好嗎?假設不需要線程同步。
對不起,這可能是基本的問題,但它是一種複雜的...可能的重複:
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?
感謝
這裏是我的device_query
:
Device 0: "Quadro FX 4600"
CUDA Driver Version/Runtime Version 4.2/4.2
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID/PCI location ID: 2/0