編輯:正如在評論中指出的那樣,如果同一個計劃(同一創建的句柄)用於通過流在同一設備上同時執行FFT,那麼the user is responsible for managing separate work areas for each usage of such plan。這個問題似乎集中在流式行爲本身上,我的其餘答案也集中在這一點上,但這是重要的一點。
如果我把一個CUFFT計劃在流0使用cuFFTSetStream()瓷磚0,瓷磚0 FFT實際上並沒有對GPU尚未主機之前執行的共享CUFFT計劃的流設置爲在它發佈tile 1在GPU上的工作之前,tile 1的流1是什麼,cuFFTExec()對此計劃的行爲是什麼?
讓我假裝你說的流1和流2,只是爲了避免任何可能的混淆NULL流。
CUFFT應該尊重計劃在通過cufftExecXXX()
傳遞給CUFFT時爲該計劃定義的流。通過cufftSetStream()
對計劃進行的後續更改對用於先前發出的cufftExecXXX()
調用的流應該沒有影響。
我們可以用一個相當簡單的測試來驗證這一點,使用profiler。考慮以下測試代碼:
$ cat t1089.cu
// NOTE: this code omits independent work-area handling for each plan
// which is necessary for a plan that will be shared between streams
// and executed concurrently
#include <cufft.h>
#include <assert.h>
#include <nvToolsExt.h>
#define DSIZE 1048576
#define BATCH 100
int main(){
const int nx = DSIZE;
const int nb = BATCH;
size_t ws = 0;
cufftHandle plan;
cufftResult res = cufftCreate(&plan);
assert(res == CUFFT_SUCCESS);
res = cufftMakePlan1d(plan, nx, CUFFT_C2C, nb, &ws);
assert(res == CUFFT_SUCCESS);
cufftComplex *d;
cudaMalloc(&d, nx*nb*sizeof(cufftComplex));
cudaMemset(d, 0, nx*nb*sizeof(cufftComplex));
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
res = cufftSetStream(plan, s1);
assert(res == CUFFT_SUCCESS);
res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
assert(res == CUFFT_SUCCESS);
res = cufftSetStream(plan, s2);
assert(res == CUFFT_SUCCESS);
nvtxMarkA("plan stream change");
res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
assert(res == CUFFT_SUCCESS);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t1089 t1089.cu -lcufft -lnvToolsExt
$ cuda-memcheck ./t1089
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
我們只是連續進行兩個正向FFT,在兩者之間切換流。我們將使用nvtx marker來清楚地標識出現計劃流關聯更改請求的點。現在讓我們來看看nvprof --print-api-trace
輸出(除去冗長的開機序言):
983.84ms 617.00us cudaMalloc
984.46ms 21.628us cudaMemset
984.48ms 37.546us cudaStreamCreate
984.52ms 121.34us cudaStreamCreate
984.65ms 995ns cudaPeekAtLastError
984.67ms 996ns cudaConfigureCall
984.67ms 517ns cudaSetupArgument
984.67ms 21.908us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416])
984.69ms 349ns cudaGetLastError
984.69ms 203ns cudaPeekAtLastError
984.70ms 296ns cudaConfigureCall
984.70ms 216ns cudaSetupArgument
984.70ms 8.8920us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421])
984.71ms 272ns cudaGetLastError
984.71ms 177ns cudaPeekAtLastError
984.72ms 314ns cudaConfigureCall
984.72ms 229ns cudaSetupArgument
984.72ms 9.9230us cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426])
984.73ms 295ns cudaGetLastError
984.77ms - [Marker] plan stream change
984.77ms 434ns cudaPeekAtLastError
984.78ms 357ns cudaConfigureCall
984.78ms 228ns cudaSetupArgument
984.78ms 10.642us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431])
984.79ms 287ns cudaGetLastError
984.79ms 193ns cudaPeekAtLastError
984.80ms 293ns cudaConfigureCall
984.80ms 208ns cudaSetupArgument
984.80ms 7.7620us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436])
984.81ms 297ns cudaGetLastError
984.81ms 178ns cudaPeekAtLastError
984.81ms 269ns cudaConfigureCall
984.81ms 214ns cudaSetupArgument
984.81ms 7.4130us cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441])
984.82ms 312ns cudaGetLastError
984.82ms 152.63ms cudaDeviceSynchronize
$
我們看到,每個FFT操作需要3所內核調用。在兩者之間,我們看到我們的nvtx標記指示何時發生計劃流更改的請求,並且毫不奇怪,這發生在前3個內核啓動後,但在最後3個之前發生。最後,我們注意到基本上所有的執行時間被吸收在最後的cudaDeviceSynchronize()
調用中。所有前面的調用都是異步的,因此在執行的第一個毫秒內執行或多或少「立即」執行。最終的同步吸收了6個內核的所有處理時間,總共約150毫秒。
所以如果cufftSetStream
是有對cufftExecC2C()
呼叫的第一次迭代的效果,我們希望看到的部分或全部射入相同的流作爲用於最後3粒中的前3個內核。但是,當我們看nvprof --print-gpu-trace
輸出:
$ nvprof --print-gpu-trace ./t1089
==3757== NVPROF is profiling process 3757, command: ./t1089
==3757== Profiling application: ./t1089
==3757== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
974.74ms 7.3440ms - - - - - 800.00MB 106.38GB/s Quadro 5000 (0) 1 7 [CUDA memset]
982.09ms 23.424ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416]
1.00551s 21.172ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421]
1.02669s 27.551ms (25600 1 1) (16 16 1) 61 17.000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426]
1.05422s 23.592ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431]
1.07781s 21.157ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436]
1.09897s 27.913ms (25600 1 1) (16 16 1) 61 17.000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
我們看到實際上前3個內核將被髮射到第一流,最後3粒發出進入到第二物流,就像要求。 (並且所有內核的總執行時間大約爲150ms,正如api跟蹤輸出所提示的那樣)。由於底層內核啓動是異步的並且在cufftExecC2C()
調用返回之前發佈,所以如果仔細考慮這一點我們會得出這樣的結論。啓動內核的流在內核啓動時指定。 (當然我認爲這被認爲是「首選」行爲。)
在這種情況下(同樣的計劃在2個流中重複使用),如果內核同時運行,是否會出現工作區衝突?什麼阻止流14中的內核覆蓋流13使用的中間結果? – KQS
是的,這是正確的。更新了我的答案。在這種特殊情況下,基於profiler輸出的實際執行沒有任何重疊,這通常是我對任何合理大小的FFT的經驗,但是如果有重疊(實際)或意圖重疊(爲了正確)。 –