環與動態並行展開減少時間性能

我有一個簡單的程序來計算平方根，循環展開做如環與動態並行展開減少時間性能

循環展開

#include <stdio.h> 
#include <cuda.h> 
__global__ void square(float *a, int N,int idx); 


// Kernel that executes on the CUDA device 
__global__ void first(float *arr, int N) 
{ 
    int idx = 2*(blockIdx.x * blockDim.x + threadIdx.x); 
    int n=N; 
    //printf("%d\n",n); 
    for(int q=0;q<2;q++) 
    { 
    if(N<2000) 
    { 
    arr[idx+q] = arr[idx+q] * arr[idx+q]; 
    } 
    } 

} 



// main routine that executes on the host 
int main(void) 
{ 
    clock_t start = clock(),diff; 
    float *a_h, *a_d; // Pointer to host & device arrays 
    const int N = 1000; // Number of elements in arrays 
    size_t size = N * sizeof(float); 
    a_h = (float *)malloc(size);  // Allocate array on host 
    cudaMalloc((void **) &a_d, size); // Allocate array on device 
    // Initialize host array and copy it to CUDA device 
    for (int i=0; i<N; i++) a_h[i] = (float)i; 
    cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); 
    // Do calculation on device: 
    int block_size = 4; 
    //int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); 
    first <<< 4, 128 >>> (a_d, N); 
    //cudaThreadSynchronize(); 
    // Retrieve result from device and store it in host array 
    cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); 
    // Print results 
    for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]); 
    // Cleanup 
    free(a_h); cudaFree(a_d); 
    diff = clock() - start; 
int msec = diff * 1000/CLOCKS_PER_SEC; 

printf("Time taken %d seconds %d milliseconds\n", msec/1000, msec%1000); 

}

然後意識到環路計算可以被最小化動態並行。

與動態並行展開被實現爲

與動態並行展開

#include <stdio.h> 
#include <cuda.h> 
__global__ void square(float *a, int N,int idx); 


// Kernel that executes on the CUDA device 
__global__ void first(float *arr, int N) 
{ 
    int idx = 2*(blockIdx.x * blockDim.x + threadIdx.x); 
    int n=N; 
    square <<< 1,2 >>> (arr, n,idx); 


} 

__global__ void square(float *a, int N,int idx) 
{ 
    int tdx = blockIdx.x * blockDim.x + threadIdx.x; 
    printf("%d\n",N); 
    if(N<2000) 
    { 
    a[tdx+idx] = a[tdx+idx] * a[tdx+idx]; 
    } 
} 

// main routine that executes on the host 
int main(void) 
{ 
    clock_t start = clock(),diff; 
    float *a_h, *a_d; // Pointer to host & device arrays 
    const int N = 1000; // Number of elements in arrays 
    size_t size = N * sizeof(float); 
    a_h = (float *)malloc(size);  // Allocate array on host 
    cudaMalloc((void **) &a_d, size); // Allocate array on device 
    // Initialize host array and copy it to CUDA device 
    for (int i=0; i<N; i++) a_h[i] = (float)i; 
    cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); 
    // Do calculation on device: 
    int block_size = 4; 
    //int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); 
    first <<< 4, 128 >>> (a_d, N); 
    //cudaThreadSynchronize(); 
    // Retrieve result from device and store it in host array 
    cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); 
    // Print results 
    for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]); 
    // Cleanup 
    free(a_h); cudaFree(a_d); 
    diff = clock() - start; 
int msec = diff * 1000/CLOCKS_PER_SEC; 

printf("Time taken %d seconds %d milliseconds\n", msec/1000, msec%1000); 

}

與展開動態並行的執行需要更多的時間用於executio不僅僅是展開。 Aren，我們假設在這種情況下通過動態並行來提高執行時間？

來源

2014-09-23 Malacu

動態並行性主要用於具有動態並行性的情況。那就是：在你完成一些計算之前，你不知道需要多少並行性的情況。不是將數據傳回主機，然後立即將數據傳送到參數化另一個啓動中，而是從內核啓動。在這種模式下，避免內核啓動之間的memcpys，你會看到加速。

在上例中，情況並非如此。你可以從主機啓動兩倍的線程。沒有任何動態的要求，因爲沒有可用的並行機制，在第一次內核啓動時你並不知道。

此外，使用動態並行機制啓動的內核的性能要求與從主機啓動的性能要求類似。您必須啓動合理數量的工作，否則啓動延遲將主導您的計算時間。

來源

2014-09-23 10:28:03 Jez

您可能希望補充一點，在這種情況下，子內核啓動開銷完全掩蓋了計算時間以僅執行兩個子線程。出於你在答案中闡述的原因，從計算的角度來看，啓動一個新的內核只是中斷。 – JackOLantern 2014-09-23 21:34:41

環與動態並行展開減少時間性能

回答

相關問題