CUDA時鐘（）導致零時鐘週期

我想使用clock（）來比較不同的內核實現。我試圖用一個簡單的SAXPY例子來實現它，但它導致零時鐘週期，這是不太可能的。CUDA時鐘（）導致零時鐘週期

我已經找到了一些關於如何實現clock（）的例子。 here和here。但不知何故轉移到我的代碼不起作用。

這裏是我使用的代碼：

/* SAXPY code example from https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/ */ 

#include <stdio.h> 

// The declaration specifier __global__ defines a kernel. This code 
// will be copied to the device and will be executed there in parallel 
__global__ 
void saxpy(int n, float a, float *x, float *y, int *kernel_clock) 
{ 
    // The indexing of the single threads is done with the following 
    // code line 
    int i = blockIdx.x*blockDim.x + threadIdx.x; 

    clock_t start = clock(); 

    // Each thread is executing just one position of the arrays 
    if (i < n) y[i] = a*x[i] + y[i]; 

    clock_t stop = clock(); 

    kernel_clock[i] = (int) (stop-start); 
} 

int main(void) 
{ 
    // Clock cycles of threads 
    int *kernel_clock; 
    int *d_kernel_clock; 
    // Creating a huge number 
    int N = 1<<20; 
    float *x, *y, *d_x, *d_y; 
    // Allocate an array on the *host* of the size of N 
    x = (float*)malloc(N*sizeof(float)); 
    y = (float*)malloc(N*sizeof(float)); 
    kernel_clock = (int*)malloc(N*sizeof(int)); 

    // Allocate an array on the *device* of the size of N 
    cudaMalloc(&d_x, N*sizeof(float)); 
    cudaMalloc(&d_y, N*sizeof(float)); 
    cudaMalloc(&d_kernel_clock, N*sizeof(int)); 

    // Filling the array of the host 
    for (int i = 0; i < N; i++) { 
    x[i] = 1.0f; 
    y[i] = 2.0f; 
    } 

    // Copy the host array to the device array 
    cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); 
    cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); 
    cudaMemcpy(d_kernel_clock, kernel_clock, N*sizeof(int), cudaMemcpyHostToDevice); 

    // Perform SAXPY on 1M elements. The triple chevrons dedicates how 
    // the threads are grouped on the device 
    saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y, d_kernel_clock); 
    cudaDeviceSynchronize(); 

    // Copy the result from the device to the host 
    cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); 
    cudaMemcpy(kernel_clock, d_kernel_clock, N*sizeof(int), cudaMemcpyDeviceToHost); 

    // Calculate average clock time 
    float average_clock = 0; 
    for (int i = 0; i < N; i++) { 
     average_clock += (float) (kernel_clock[i]); 
    } 
    average_clock /= N; 

    // Display the time to the screen 
    printf ("Kernel clock cycles: %.4f\n", average_clock); 

    // Free the memory on the host and device 
    free(x); 
    free(y); 
    free(kernel_clock); 
    cudaFree(d_x); 
    cudaFree(d_y); 
    cudaFree(d_kernel_clock); 
}

此代碼示例導致：

Kernel clock cycles: 0.0000

我不知道我做錯了。所以我的問題是：我怎樣才能得到一個合理的結果？

來源

2016-10-03 stebran

我沒有看到任何錯誤檢查。如果你用'cuda-memcheck'運行你的代碼會發生什麼？ –

'cuda-memcheck'提供0錯誤 '========錯誤摘要：0錯誤' – stebran

從你在你的問題聯繫到答案的一個引用

你也應該知道，編譯器和彙編做執行指令重新排序，所以你可能要檢查時鐘電話在SASS輸出（使用cuobjdump來檢查）時，不要纏繞在彼此的旁邊。

我相信這是你問題的根源。如果我與CUDA 8日發佈工具包編譯內核，然後拆卸與cuobjdump所產生的機器代碼，我得到如下：

code for sm_52 
      Function : _Z5saxpyifPfS_Pi 
    .headerflags @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)" 
                          /* 0x001c4400fe0007f6 */ 
    /*0008*/     MOV R1, c[0x0][0x20];          /* 0x4c98078000870001 */ 
    /*0010*/   {   CS2R R7, SR_CLOCKLO;          /* 0x50c8000005070007 */ 
    /*0018*/     S2R R0, SR_CTAID.X;  }        /* 0xf0c8000002570000 */ 
                          /* 0x083fc400e3e007f0 */ 
    /*0028*/   {   CS2R R8, SR_CLOCKLO;          /* 0x50c8000005070008 */ 
    /*0030*/     S2R R2, SR_TID.X;  }         /* 0xf0c8000002170002 */ 
    /*0038*/     XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ;     /* 0x4f107f8000270003 */ 
                          /* 0x081fc400fec207f6 */ 
    /*0048*/     XMAD R2, R0.reuse, c[0x0] [0x8], R2;      /* 0x4e00010000270002 */ 
    /*0050*/     XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2;       /* 0x5b30011800370000 */ 
    /*0058*/     ISETP.GE.AND P0, PT, R0.reuse, c[0x0][0x140], PT;   /* 0x4b6d038005070007 */ 
                          /* 0x001fd400fc2007ec */ 
    /*0068*/     SHR R9, R0, 0x1f;           /* 0x3829000001f70009 */ 
    /*0070*/    @!P0 SHF.L.U64 R2, RZ, 0x2, R0;         /* 0x36f800400028ff02 */ 
    /*0078*/    @!P0 SHF.L.U64 R3, R0, 0x2, R9;         /* 0x36f804c000280003 */ 
                          /* 0x001fc040fe4207f6 */ 
    /*0088*/    @!P0 IADD R4.CC, R2.reuse, c[0x0][0x148];      /* 0x4c10800005280204 */ 
    /*0090*/    @!P0 IADD.X R5, R3.reuse, c[0x0][0x14c];       /* 0x4c10080005380305 */ 
    /*0098*/   { @!P0 IADD R2.CC, R2, c[0x0][0x150];        /* 0x4c10800005480202 */ 
    /*00a8*/    @!P0 LDG.E R4, [R4];  }         /* 0x0005c400fe400076 */ 
                          /* 0xeed4200000080404 */ 
    /*00b0*/    @!P0 IADD.X R3, R3, c[0x0][0x154];        /* 0x4c10080005580303 */ 
    /*00b8*/    @!P0 LDG.E R6, [R2];            /* 0xeed4200000080206 */ 
                          /* 0x001fd800fea007e1 */ 
    /*00c8*/     LEA R10.CC, R0, c[0x0][0x158], 0x2;       /* 0x4bd781000567000a */ 
    /*00d0*/     IADD R8, -R7, R8;           /* 0x5c12000000870708 */ 
    /*00d8*/     LEA.HI.X R9, R0, c[0x0][0x15c], R9, 0x2;     /* 0x1a17048005770009 */ 
                          /* 0x001fc008fe4007f1 */ 
    /*00e8*/     MOV R7, R9;             /* 0x5c98078000970007 */ 
    /*00f0*/    @!P0 FFMA R0, R4, c[0x0][0x144], R6;        /* 0x4980030005180400 */ 
    /*00f8*/   {   MOV R6, R10;            /* 0x5c98078000a70006 */ 
    /*0108*/    @!P0 STG.E [R2], R0;  }         /* 0x001ffc005e2001f2 */ 
                          /* 0xeedc200000080200 */ 
    /*0110*/     STG.E [R6], R8;            /* 0xeedc200000070608 */ 
    /*0118*/     EXIT;              /* 0xe30000000007000f */ 
                          /* 0x001f8000fc0007ff */ 
    /*0128*/     BRA 0x120;             /* 0xe2400fffff07000f */ 
    /*0130*/     NOP;              /* 0x50b0000000070f00 */ 
    /*0138*/     NOP;              /* 0x50b0000000070f00 */ 
      .................................

你可以看到時鐘指令已被重新排序，以便他們被稱爲無任何他們之間的代碼。對於許多（如果不是全部的話）經線運行此代碼，這將導致零或非常接近零時鐘測量。

來源

2016-10-03 14:32:42 talonmies

謝謝！我現在明白了這個問題，但是我在輸出中看到了哪些行？ – stebran

CUDA時鐘（）導致零時鐘週期

回答

相關問題