CUDA設備到主機複製很慢

我運行Windows 7 64位，CUDA 4.2，Visual Studio 2010中CUDA設備到主機複製很慢

首先，我運行CUDA一些代碼，然後下載數據傳回主機。然後做一些處理並返回到設備。然後我做了以下從設備到主機的副本，它運行速度非常快，如1ms。

clock_t start, end; 
count=1000000; 
thrust::host_vector <int> h_a(count); 
thrust::device_vector <int> d_b(count,0); 
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]); 
start=clock(); 
thrust::copy(d_b.begin(), d_b.end(), h_a.begin()); 
end=clock(); 
cout<<"Time Spent:"<<end-start<<endl;

完成需要〜1ms。

然後我又在cuda上運行了一些其他代碼，主要是原子操作。然後我將設備上的數據複製到主機上，需要很長時間，例如〜9s。

__global__ void dosomething(int *d_bPtr) 
{ 
.... 
atomicExch(d_bPtr,c) 
.... 
} 

start=clock(); 
thrust::copy(d_b.begin(), d_b.end(), h_a.begin()); 
end=clock(); 
cout<<"Time Spent:"<<end-start<<endl;

〜787-9

我的代碼多次跑，例如

int i=0; 
while (i<10) 
{ 
clock_t start, end; 
count=1000000; 
thrust::host_vector <int> h_a(count); 
thrust::device_vector <int> d_b(count,0); 
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]); 
start=clock(); 
thrust::copy(d_b.begin(), d_b.end(), h_a.begin()); 
end=clock(); 
cout<<"Time Spent:"<<end-start<<endl; 

__global__ void dosomething(int *d_bPtr) 
{ 
.... 
atomicExch(d_bPtr,c) 
.... 
} 

start=clock(); 
thrust::copy(d_b.begin(), d_b.end(), h_a.begin()); 
end=clock(); 
cout<<"Time Spent:"<<end-start<<endl; 
i++ 
}

的結果幾乎相同。
可能是什麼問題？

謝謝！

來源

2012-10-09 UserKiwi

我還是不明白你怎麼能''推力:: raw_ptr_cast'與'device_vector'第一個index.I我試圖從你的代碼運行一個片段，我得到了'錯誤：類模板的參數列表「推力:: device_ptr「missing missing」error ... – Recker

對不起，我的壞。它應該是int * device_ptr = thrust :: raw_pointer_cast（＆d_b [0]）;我會更新它。你認爲這是造成問題嗎？或者我應該直接使用d_b.begin（）作爲原子操作的輸入嗎？謝謝！ – UserKiwi

你可以發佈你能想出的最短重現器嗎？我試着從你的代碼中做一個簡單的例子，但沒有看到任何錯誤。代碼中有各種奇怪的語法錯誤，所以它有助於創建可編譯的複製器。 –

問題是時間問題之一，而不是複製性能的任何變化。內核啓動在CUDA中是異步的，因此您測量的不僅僅是thrust::copy的時間，還包括您啓動完成的先前內核。如果您將代碼複製操作的代碼更改爲如下代碼：

cudaDeviceSynchronize(); // wait until prior kernel is finished 
start=clock(); 
thrust::copy(d_b.begin(), d_b.end(), h_a.begin()); 
end=clock(); 
cout<<"Time Spent:"<<end-start<<endl;

您應該發現傳輸時間恢復到之前的性能。所以你真正的問題不是「爲什麼thrust::copy慢」，它是「爲什麼我的內核很慢」。根據你發佈的相當可怕的僞代碼，答案是「因爲它充滿了調用內核內存事務的atomicExch()調用」。

來源

2012-10-09 05:05:50 talonmies

謝謝talonmies !!我會按照你的建議，明天再次運行代碼來看看。對不起，可怕的僞代碼。我是新來的cuda，我現在沒有我的源代碼可用...非常感謝！ – UserKiwi

我今天測試，talonmies是完全正確的！一切都和他所描述的一樣！非常感謝你！ – UserKiwi

@UserKiwi：如果這回答了你的問題，那麼也許你會這樣[接受它]（http://meta.stackexchange.com/a/5235/163653）。這標誌着問題已經完成。 – talonmies

我建議你使用cudpp，在我看來比推力更快（我在寫關於優化的主要論文，並且我嘗試了兩個庫）。如果複製速度非常慢，您可以嘗試編寫自己的內核來複制數據。

來源

2012-10-09 05:50:14

CUDA設備到主機複製很慢

回答

相關問題