CUDA內存（類型）僅用於內核調用期間的設備計算（計算1.1或1.2）

我目前正在學習CUDA，並且我的算法必須根據某些輸入數據執行一些繁重的計算。這些計算是在循環中進行的，最多可循環1024次。只要每個內核有少量線程（< 100'000），一切正常，但如果我想要使用更多的線程，內核將被Windows中斷，因爲它需要很長時間才能完成。CUDA內存（類型）僅用於內核調用期間的設備計算（計算1.1或1.2）

我的解決辦法是分裂的大量計算在幾個內核調用：

的主要內核，即準備輸入數據和計算第一輪X（循環展開）。這隻會在每個輸入中調用一次。
工作內核，它執行下一輪x輪（循環展開）。這將根據需要經常調用以計算所有需要的輪次。

各個內核調用之間（一個主，許多工作），我必須保存16 +長度個字節的數據，這將在下一呼叫被使用（長度是輸入的長度，它是根據主要調用）。內核將初始寫入這些字節，並且內核將讀取它們，運行下一個計算並將原始數據寫入新的結果。我只需要這些設備上的數據，不需要主機訪問。我必須使用哪種內存？至少它必須是全局內存，因爲它是在內核調用期間持久化的唯一可寫內存，不是嗎？但那麼，什麼？你能否給我一個關於如何繼續使用正確內存（和最佳性能）的建議？

在「僞」這可能是這樣的：

prepare memory to hold threads * (16 + length) bytes 

for length = 1 to x step 1 
    call mainKernel 
    rounds = 1024 - rounds_done_in_main 
    for rounds to 0 step rounds_done_in_work 
    call workKernel 
    end for 
end for 

cleanup memory 

-------- 

template <unsigned char length> __global__ mainKernel() { 
    unsigned char input[length]; 
    unsigned char output[16]; 
    const int tid = ...; 

    devPrepareInput<length>(input); 

    calc round 1: doSomething<length>(output, input) 
    calc round 2: doSomething<length>(output, output + input) // '+' == append 

    write data to memory based on tid // data == output + input 
} 

template <unsigned char length, remaining rounds> __global__ workKernel() { 
    unsigned char *input; 
    unsigned char *output; 
    const int tid = ...; 

    read data from memory based on tid 
    ouput = data 
    input = data+16 

    if rounds >= 1 
    calc round x : doSomething<length>(output, output + input) 
    if rounds >= 2 
    calc round x+1: doSomething<length>(output, output + input) // '+' == append 

    if rounds == x // x is the number of rounds in the last work call 
    do final steps on output 
    else 
    write ouput + input to memory based on tid (for next call) 
}

來源

2013-04-08 grubi

如果有足夠的塊，減少網格大小並多次啓動內核要輕鬆得多，爲內核中的塊編號添加適當的偏移量。 – tera 2013-04-08 10:49:53

當你提供的所有東西都是僞代碼時，它會很難提供關於性能的建議，這些僞代碼包含大量的「做某事」的模板實例。你真的想知道你能更具體些嗎？ – talonmies 2013-04-08 11:13:54

代碼根本就沒有關係，而且在這裏發佈也是太麻煩了。我在問我要做什麼/我可以使用哪種設備內存來在內核調用（具有讀取數據/寫入數據的行）之間保存數據。表現與該記憶有關。 – grubi 2013-04-08 11:39:15

是的，你可以用設備存儲器做到這一點。一個用__device__聲明的變量提供了一個可以被內核直接使用的緩衝區的靜態聲明，不需要任何cudaMemcpy操作，也不需要指針被明確地傳遞給內核。由於它具有lifetime of the application，因此它中的數據將從一個內核調用持續到另一個內核調用。

#define NUM_THREADS 1024 
#define DATA_PER_THREAD 16 
__device__ int temp_data[NUM_THREADS*DATA_PER_THREAD]; 

__global__ my_kernel1(...){ 
    int my_data[DATA_PER_THREAD] = {0}; 
    int idx = threadIdx.x + blockDim.x * blockIdx.x; 
    // perform calculations 

    // write out temp data 
    for (int i = 0; i < DATA_PER_THREAD; i++) temp_data[i + (idx * DATA_PER_THREAD)] = my_data[i]; 
    } 

__global__ my_kernel2(...){ 
    int my_data[DATA_PER_THREAD]; 
    // read in temp data 
    for (int i = 0; i < DATA_PER_THREAD; i++) my_data[i] = temp_data[i + (idx * DATA_PER_THREAD)]; 
    // perform calculations 

    }

有多種方法可以根據您在內核中的使用模式對其進行優化。數據傳輸到my_data並不是真的有必要。很明顯，你的內核代碼可以直接訪問temp_data而不是my_data，並有適當的索引。

如果您確實要加載/存儲它，則可以在數據的for環回讀取和寫入過程中交錯數據以允許聯合訪問。

來源

2014-02-10 05:57:25

CUDA內存（類型）僅用於內核調用期間的設備計算（計算1.1或1.2）

回答

相關問題