0
當我在特斯拉C2050的SDK(4.0)運行simpleMultiCopy我得到如下結果:爲什麼CUDA中的重疊數據傳輸比預期慢?
[simpleMultiCopy] starting...
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)
> Device name: Tesla C2050
> CUDA Capability 2.0 hardware with 14 multi-processors
> scale_factor = 1.00
> array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)
Measured timings (throughput):
Memcpy host to device : 2.725792 ms (6.154988 GB/s)
Memcpy device to host : 2.723360 ms (6.160484 GB/s)
Kernel : 0.611264 ms (274.467599 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 6.060416 ms
Compute can overlap with one transfer: 5.449152 ms
Compute can overlap with both data transfers: 2.725792 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 6.113555 ms
Avg. time when overlapped using 4 streams : 4.308822 ms
Avg. speedup gained (serialized - overlapped) : 1.804733 ms
Measured throughput:
Fully serialized execution : 5.488530 GB/s
Overlapped using 4 streams : 7.787379 GB/s
[simpleMultiCopy] test results...
PASSED
這表明預期的運行時間爲2.7毫秒,而它實際上需要4.3。究竟是什麼導致了這種差異? (我也發佈了http://forums.developer.nvidia.com/devforum/discussion/comment/8976這個問題。)
的過剩,而在實際應用中重要的是在這個測試程序測量消除。我已經運行10000次重複,而不是10的方案,並得到了相同的值(小於0.01納秒差) – 2012-02-13 11:20:40
你是正確的,如果工作量轉移 - 或計算重,相對收益將大大降低。但我並沒有期待2倍或3倍的加速,但是運行時間是所涉及的三個操作中的最大值。實際運行時間幾乎延長了60%,這必須來自某個地方。爲什麼在這種情況下60%以及其他GPU會發生什麼? – 2012-02-13 11:46:46