nvprof事件「fb_subp0_read_sectors」和「fb_subp1_read_sectors」不報告正確的結果

我試圖計算簡單向量添加內核的DRAM（全局內存）訪問次數。nvprof事件「fb_subp0_read_sectors」和「fb_subp1_read_sectors」不報告正確的結果

__global__ void AddVectors(const float* A, const float* B, float* C, int N) 
{ 
    int blockStartIndex = blockIdx.x * blockDim.x * N; 
    int threadStartIndex = blockStartIndex + threadIdx.x; 
    int threadEndIndex = threadStartIndex + (N * blockDim.x); 
    int i; 

    for(i=threadStartIndex; i<threadEndIndex; i+=blockDim.x){ 
     C[i] = A[i] + B[i]; 
    } 
}

網格大小= 180 塊大小= 128

尺寸數組= 180 * 128 * N漂浮，其中N是輸入參數（每線程元素）

在N = 1，尺寸的數組= 180 * 128 * 1浮點= 90KB

所有陣列A，B和C應該從DRAM中讀取。

因此理論上，

DRAM寫（C）= 2880（32字節訪問） DRAM讀取（A，B）= 2880 + 2880 = 5760（32字節訪問）

但是，當我使用nvprof

DRAM寫入= fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880（32字節訪問） DRAM讀取= fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30（32字節訪問）

現在這是問題所在。理論上應該有5760 DRAM讀取，但nvprof只報告30，對我來說這看起來不可能。此外，如果您將向量的大小加倍（N = 2），仍然報告的DRAM訪問仍然爲30.

如果有人能夠闡明某些亮點，那將會很棒。

我已使用編譯器選項「-Xptxas -dlcm=cg」

感謝禁用的L1高速緩存， Waruna

來源

2014-01-13 warunapww

如果你的內核啓動到源緩衝區從主機複製到設備之前完成cudaMemcpy，即得L2緩存中的源緩衝區，因此內核沒有看到L2讀取的任何錯誤，並且您獲得的編號較少（fb_subp0_read_sectors + fb_subp1_read_sectors）。

如果在內核啓動之前註釋掉cudaMemcpy，您將看到fb_subp0_read_sectors和fb_subp1_read_sectors的事件值包含您期望的值。

來源

2014-01-15 06:12:00 Swapna

我也將陣列C從主機複製到設備。因此，不應該有任何L2寫入錯誤。但是當我檢查分析器時，它顯示2880個寫入未命中。你能解釋這種行爲嗎？ – warunapww

nvprof事件「fb_subp0_read_sectors」和「fb_subp1_read_sectors」不報告正確的結果

回答

相關問題