2012-02-09 49 views
0

當我在特斯拉C2050的SDK(4.0)運行simpleMultiCopy我得到如下結果:爲什麼CUDA中的重疊數據傳輸比預期慢?

[simpleMultiCopy] starting... 
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores) 
> Device name: Tesla C2050 
> CUDA Capability 2.0 hardware with 14 multi-processors 
> scale_factor = 1.00 
> array_size = 4194304 


Relevant properties of this CUDA device 
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap") 
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution 
    (compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000) 

Measured timings (throughput): 
Memcpy host to device : 2.725792 ms (6.154988 GB/s) 
Memcpy device to host : 2.723360 ms (6.160484 GB/s) 
Kernel   : 0.611264 ms (274.467599 GB/s) 

Theoretical limits for speedup gained from overlapped data transfers: 
No overlap at all (transfer-kernel-transfer): 6.060416 ms 
Compute can overlap with one transfer: 5.449152 ms 
Compute can overlap with both data transfers: 2.725792 ms 

Average measured timings over 10 repetitions: 
Avg. time when execution fully serialized : 6.113555 ms 
Avg. time when overlapped using 4 streams : 4.308822 ms 
Avg. speedup gained (serialized - overlapped) : 1.804733 ms 

Measured throughput: 
Fully serialized execution  : 5.488530 GB/s 
Overlapped using 4 streams  : 7.787379 GB/s 
[simpleMultiCopy] test results... 
PASSED 

這表明預期的運行時間爲2.7毫秒,而它實際上需要4.3。究竟是什麼導致了這種差異? (我也發佈了http://forums.developer.nvidia.com/devforum/discussion/comment/8976這個問題。)

回答

1

第一次內核啓動無法啓動,直到第一個memcpy完成,最後一個memcpy無法啓動,直到最後一次內核啓動完成。所以,有「懸而未決」引入了一些你正在觀察的開銷。您可以通過增加數據流的數量來減小「超出量」的大小,但數據流的內部引擎同步會產生自己的開銷。

需要注意的是重疊計算+轉讓並不總是有利於給定的工作量是非常重要的 - 除了上述的開銷問題,工作量本身花費的時間等量做計算和數據傳輸。根據Amdahl定律,當工作負載變爲轉移發現或計算受限時,2倍或3倍的潛在加速度下降。

+0

的過剩,而在實際應用中重要的是在這個測試程序測量消除。我已經運行10000次重複,而不是10的方案,並得到了相同的值(小於0.01納秒差) – 2012-02-13 11:20:40

+0

你是正確的,如果工作量轉移 - 或計算重,相對收益將大大降低。但我並沒有期待2倍或3倍的加速,但是運行時間是所涉及的三個操作中的最大值。實際運行時間幾乎延長了60%,這必須來自某個地方。爲什麼在這種情況下60%以及其他GPU會發生什麼? – 2012-02-13 11:46:46