2013-07-26 75 views
6

NVIDIA GPU是否支持亂序執行?NVIDIA GPU上的指令級並行(ILP)和無序執行

我的第一個猜測是他們不包含這樣昂貴的硬件。但是,閱讀CUDA progamming guide時,本指南建議使用指令級並行性(ILP)來提高性能。

ILP不是支持亂序執行的硬件可以利用的功能嗎?或者NVIDIA的ILP僅僅意味着編譯器級的指令重新排序,因此它的順序在運行時仍然是固定的。換句話說,編譯器和/或程序員必須按照可以在運行時通過按順序執行來實現ILP的方式來安排指令的順序?

+6

一種亂爲了利用指令級並行不需要處理器。具有超標量執行的按序處理器也可以受益。 – njuffa

回答

5

流水線操作是一種常見的ILP技術,肯定會在NVidia的GPU上實現。我想你同意流水線不依賴於亂序執行。另外,NVidia GPU具有計算能力2.0及更高版本(2或4)的多個warp調度程序。如果你的代碼在線程中有2個(或更多)連續和獨立的指令(或者編譯器以某種方式對它進行重新排序),那麼你也可以從調度器中利用這個ILP。

下面是一個很好解釋的問題,關於2範圍warp調度器+流水線如何一起工作。 How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

同時結帳Vasily Volkov在GTC 2010上的演示。他通過實驗發現ILP如何提高CUDA代碼性能。 http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

就GPU上亂序執行而言,我不這麼認爲。如您所知,硬件指令重新排序,推測性執行所有這些東西對於每個SM實施來說都太昂貴了。線程級並行可以填補缺乏亂序執行的空白。遇到真正的依賴關係時,其他一些經紗可以踢入並填充管道。

1

以下代碼報告指令級並行性(ILP)的示例。

示例中的__global__函數只是在兩個數組之間執行賦值。對於ILP=1的情況,我們擁有與數組元素數量N一樣多的線程,以便每個線程執行一次分配。相反,對於ILP=2的情況,我們有多個N/2線程,每個線程處理2元素。一般情況下,對於ILP=k的情況,我們有多個N/k的線程每個處理k個元素。

除了下面的代碼,我還報告了在NVIDIA GT920M(開普勒體系結構)上執行的定時,對於NILP的不同值。正如可以看出:

  1. N,存儲器帶寬接近最大一個用於GT920M卡大的值,即,14.4GB/s,達到;
  2. 對於任何固定的N,更改值ILP不會改變性能。

關於點2,我還測試麥克斯韋相同的代碼,並且觀察到相同的行爲(性能針對ILP沒有變化)。對於ILP的性能變化,請參閱The efficiency and performance of ILP for the NVIDIA Kepler architecture的報告也報告費米架構的測試。

內存速度已經由下式計算:

(2.f * 4.f * N * numITER)/(1e9 * timeTotal * 1e-3) 

其中

4.f * N * numITER 

是讀的數量或寫入,

2.f * 4.f * N * numITER 

是讀的數AND寫道,

timeTotal * 1e-3 

secondstimeTotalms)的時間。

THE CODE

// --- GT920m - 14.4 GB/s 
//  http://gpuboss.com/gpus/GeForce-GTX-280M-vs-GeForce-920M 

#include<stdio.h> 
#include<iostream> 

#include "Utilities.cuh" 
#include "TimingGPU.cuh" 

#define BLOCKSIZE 32 

#define DEBUG 

/****************************************/ 
/* INSTRUCTION LEVEL PARALLELISM KERNEL */ 
/****************************************/ 
__global__ void ILPKernel(const int * __restrict__ d_a, int * __restrict__ d_b, const int ILP, const int N) { 

    const int tid = threadIdx.x + blockIdx.x * blockDim.x * ILP; 

    if (tid >= N) return; 

    for (int j = 0; j < ILP; j++) d_b[tid + j * blockDim.x] = d_a[tid + j * blockDim.x]; 

} 

/********/ 
/* MAIN */ 
/********/ 
int main() { 

    //const int N = 8192; 
    const int N = 524288 * 32; 
    //const int N = 1048576; 
    //const int N = 262144; 
    //const int N = 2048; 

    const int numITER = 100; 

    const int ILP = 16; 

    TimingGPU timerGPU; 

    int *h_a = (int *)malloc(N * sizeof(int)); 
    int *h_b = (int *)malloc(N * sizeof(int)); 

    for (int i = 0; i<N; i++) { 
     h_a[i] = 2; 
     h_b[i] = 1; 
    } 

    int *d_a; gpuErrchk(cudaMalloc(&d_a, N * sizeof(int))); 
    int *d_b; gpuErrchk(cudaMalloc(&d_b, N * sizeof(int))); 

    gpuErrchk(cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice)); 
    gpuErrchk(cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice)); 

    /**************/ 
    /* ILP KERNEL */ 
    /**************/ 
    float timeTotal = 0.f; 
    for (int k = 0; k < numITER; k++) { 
     timerGPU.StartCounter(); 
     ILPKernel << <iDivUp(N/ILP, BLOCKSIZE), BLOCKSIZE >> >(d_a, d_b, ILP, N); 
#ifdef DEBUG 
     gpuErrchk(cudaPeekAtLastError()); 
     gpuErrchk(cudaDeviceSynchronize()); 
#endif 
     timeTotal = timeTotal + timerGPU.GetCounter(); 
    } 

    printf("Bandwidth = %f GB/s; Num blocks = %d\n", (2.f * 4.f * N * numITER)/(1e6 * timeTotal), iDivUp(N/ILP, BLOCKSIZE)); 
    gpuErrchk(cudaMemcpy(h_b, d_b, N * sizeof(int), cudaMemcpyDeviceToHost)); 
    for (int i = 0; i < N; i++) if (h_a[i] != h_b[i]) { printf("Error at i = %i for kernel0! Host = %i; Device = %i\n", i, h_a[i], h_b[i]); return 1; } 

    return 0; 

} 

PERFORMANCE

GT 920M 
N = 512 - ILP = 1 - BLOCKSIZE = 512 (1 block - each block processes 512 elements) - Bandwidth = 0.092 GB/s 

N = 1024 - ILP = 1 - BLOCKSIZE = 512 (2 blocks - each block processes 512 elements) - Bandwidth = 0.15 GB/s 

N = 2048 - ILP = 1 - BLOCKSIZE = 512 (4 blocks - each block processes 512 elements) - Bandwidth = 0.37 GB/s 
N = 2048 - ILP = 2 - BLOCKSIZE = 256 (4 blocks - each block processes 512 elements) - Bandwidth = 0.36 GB/s 
N = 2048 - ILP = 4 - BLOCKSIZE = 128 (4 blocks - each block processes 512 elements) - Bandwidth = 0.35 GB/s 
N = 2048 - ILP = 8 - BLOCKSIZE = 64 (4 blocks - each block processes 512 elements) - Bandwidth = 0.26 GB/s 
N = 2048 - ILP = 16 - BLOCKSIZE = 32 (4 blocks - each block processes 512 elements) - Bandwidth = 0.31 GB/s 

N = 4096 - ILP = 1 - BLOCKSIZE = 512 (8 blocks - each block processes 512 elements) - Bandwidth = 0.53 GB/s 
N = 4096 - ILP = 2 - BLOCKSIZE = 256 (8 blocks - each block processes 512 elements) - Bandwidth = 0.61 GB/s 
N = 4096 - ILP = 4 - BLOCKSIZE = 128 (8 blocks - each block processes 512 elements) - Bandwidth = 0.74 GB/s 
N = 4096 - ILP = 8 - BLOCKSIZE = 64 (8 blocks - each block processes 512 elements) - Bandwidth = 0.74 GB/s 
N = 4096 - ILP = 16 - BLOCKSIZE = 32 (8 blocks - each block processes 512 elements) - Bandwidth = 0.56 GB/s 

N = 8192 - ILP = 1 - BLOCKSIZE = 512 (16 blocks - each block processes 512 elements) - Bandwidth = 1.4 GB/s 
N = 8192 - ILP = 2 - BLOCKSIZE = 256 (16 blocks - each block processes 512 elements) - Bandwidth = 1.1 GB/s 
N = 8192 - ILP = 4 - BLOCKSIZE = 128 (16 blocks - each block processes 512 elements) - Bandwidth = 1.5 GB/s 
N = 8192 - ILP = 8 - BLOCKSIZE = 64 (16 blocks - each block processes 512 elements) - Bandwidth = 1.4 GB/s 
N = 8192 - ILP = 16 - BLOCKSIZE = 32 (16 blocks - each block processes 512 elements) - Bandwidth = 1.3 GB/s 

... 

N = 16777216 - ILP = 1 - BLOCKSIZE = 512 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.9 GB/s 
N = 16777216 - ILP = 2 - BLOCKSIZE = 256 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.8 GB/s 
N = 16777216 - ILP = 4 - BLOCKSIZE = 128 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.8 GB/s 
N = 16777216 - ILP = 8 - BLOCKSIZE = 64 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.7 GB/s 
N = 16777216 - ILP = 16 - BLOCKSIZE = 32 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.6 GB/s