陣列與CUDA中結構陣列的結構

從我在這裏閱讀的一些評論中，出於某種原因，對於像CUDA這樣的並行實現，最好有Structure of Arrays（SoA）超過Array of Structures（AoS）？如果那是真的，任何人都可以解釋爲什麼？在此先感謝！ AOS相對的SOA以獲得最佳性能陣列與CUDA中結構陣列的結構

來源

2013-07-29 BugShotGG

我已經發佈陣列的結構和結構的陣列之間的性能的比較作爲回答這個訊息：[在CUDA中對結構數組進行排序]（http://stackoverflow.com/questions/23541503/sorting-arrays-of-structures-in-cuda/23645954#23645954）。 – JackOLantern

選擇通常取決於訪問模式。然而，這不僅限於CUDA-類似的考慮適用於性能可受存儲器訪問模式顯着影響的任何架構，例如，您擁有緩存的地方，或者連續訪問內存時性能更好的地方（例如CUDA中的合併內存訪問）。

E.g.爲RGB像素與單獨的RGB平面：

struct { 
    uint8_t r, g, b; 
} AoS[N]; 

struct { 
    uint8_t r[N]; 
    uint8_t g[N]; 
    uint8_t b[N]; 
} SoA;

如果將要訪問的每個像素的R/G/B成分同時然後AOS通常是有道理的，因爲連續讀取R，G，B成分的將是連續的並且通常包含在相同的緩存行中。對於CUDA來說，這也意味着內存讀取/寫入合併。

但是，如果你要處理的顏色平面分別然後SoA的可能是首選，例如如果你想通過一些比例因子來縮放所有的R值，那麼SoA意味着所有的R分量都是連續的。

一個進一步的考慮是填充和/對齊。對於上面的RGB示例，AoS佈局中的每個元素都對齊到3個字節的倍數，這對CUDA，SIMD等可能不方便 - 在某些情況下，甚至可能需要在結構中填充以使對齊更方便（例如添加一個虛擬uint8_t元素以確保4個字節的對齊）。然而，在SoA情況下，這些平面是字節對齊的，對於某些算法/體系結構可以更方便。

對於大多數圖像處理類型的應用程序，AoS方案更爲常見，但對於其他應用程序或特定圖像處理任務，情況可能並非總是如此。當沒有明顯的選擇時，我會推薦AoS作爲默認選擇。

也this answer見AOS v的SOA更廣泛的討論。

來源

2013-07-29 12:59:57

感謝您的回覆！嗯，所以它達到了我想象的實現訪問模式。因此，如果我們有一個線程訪問AoS中的10個連續元素並假設我們有100個總元素，則SoA（每個都有這10個元素）是比讓每個線程從AoS訪問10個連續元素更好的選擇，看法？ – BugShotGG

對於RGB情況，如果每個線程一起訪問R，G，B組件，那麼AoS可能更好，而如果每個線程只訪問一個顏色平面，則SoA可能會更好。但是，您需要查看所有線程的總訪問模式，因爲外部內存訪問通常是瓶頸。 –

SOA是SIMD處理effectly好。對於幾個原因，但基本上它更有效地加載連續4個浮點寄存器中。喜歡的東西：

float v [4] = {0}; 
__m128 reg = _mm_load_ps(v);

比使用：

struct vec { float x; float, y; ....} ; 
vec v = {0, 0, 0, 0};

，並通過訪問所有成員創建__m128數據：

__m128 reg = _mm_set_ps(v.x, ....);

，如果你的陣列是16字節對齊的數據加載/存儲速度更快，有些操作可以直接在內存中執行。

來源

2013-07-29 13:12:45 alexbuisson

你可以通過也許可以和「將它變成一個數組」： __m128 reg = _mm_load_ps（&v.x); 這將假定是16字節對齊，在成員之間0字節 – pilkch

我只想提供一個簡單的例子，顯示一個結構數組（SoA）如何比結構數組（AoS）更好地執行。

在這個例子中，我考慮三種不同版本的同一代碼：

SoA的（V1）
直陣列（V2）
AOS（V3）

特別是，版本2考慮使用直陣列。版本2和3的時序對於該示例是相同的，並且結果比版本1更好。我懷疑，一般來說，直接數組可能更可取，儘管以可讀性爲代價，因爲例如可以通過const __restrict__爲這種情況啓用來自統一緩存的加載。

#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 

#include <stdio.h> 

#include <thrust\device_vector.h> 

#include "Utilities.cuh" 
#include "TimingGPU.cuh" 

#define BLOCKSIZE 1024 

/******************************************/ 
/* CELL STRUCT LEADING TO ARRAY OF STRUCT */ 
/******************************************/ 
struct cellAoS { 

    unsigned int x1; 
    unsigned int x2; 
    unsigned int code; 
    bool   done; 

}; 

/*******************************************/ 
/* CELL STRUCT LEADING TO STRUCT OF ARRAYS */ 
/*******************************************/ 
struct cellSoA { 

    unsigned int *x1; 
    unsigned int *x2; 
    unsigned int *code; 
    bool   *done; 

}; 


/*******************************************/ 
/* KERNEL MANIPULATING THE ARRAY OF STRUCT */ 
/*******************************************/ 
__global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) { 

    const int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    if (tid < N) { 
     cellAoS tempCell = d_cells[tid]; 

     tempCell.x1 = tempCell.x1 + 10; 
     tempCell.x2 = tempCell.x2 + 10; 

     d_cells[tid] = tempCell; 
    } 

} 

/******************************/ 
/* KERNEL MANIPULATING ARRAYS */ 
/******************************/ 
__global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) { 

    const int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    if (tid < N) { 

     d_x1[tid] = d_x1[tid] + 10; 
     d_x2[tid] = d_x2[tid] + 10; 

    } 

} 

/********************************************/ 
/* KERNEL MANIPULATING THE STRUCT OF ARRAYS */ 
/********************************************/ 
__global__ void AoSvsSoA_v3(cellSoA cell, const int N) { 

    const int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    if (tid < N) { 

     cell.x1[tid] = cell.x1[tid] + 10; 
     cell.x2[tid] = cell.x2[tid] + 10; 

    } 

} 

/********/ 
/* MAIN */ 
/********/ 
int main() { 

    const int N = 2048 * 2048 * 4; 

    TimingGPU timerGPU; 

    thrust::host_vector<cellAoS> h_cells(N); 
    thrust::device_vector<cellAoS> d_cells(N); 

    thrust::host_vector<unsigned int> h_x1(N); 
    thrust::host_vector<unsigned int> h_x2(N); 

    thrust::device_vector<unsigned int> d_x1(N); 
    thrust::device_vector<unsigned int> d_x2(N); 

    for (int k = 0; k < N; k++) { 

     h_cells[k].x1 = k + 1; 
     h_cells[k].x2 = k + 2; 
     h_cells[k].code = k + 3; 
     h_cells[k].done = true; 

     h_x1[k] = k + 1; 
     h_x2[k] = k + 2; 

    } 

    d_cells = h_cells; 

    d_x1 = h_x1; 
    d_x2 = h_x2; 

    cellSoA cell; 
    cell.x1 = thrust::raw_pointer_cast(d_x1.data()); 
    cell.x2 = thrust::raw_pointer_cast(d_x2.data()); 
    cell.code = NULL; 
    cell.done = NULL; 

    timerGPU.StartCounter(); 
    AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N); 
    gpuErrchk(cudaPeekAtLastError()); 
    gpuErrchk(cudaDeviceSynchronize()); 
    printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter()); 

    //timerGPU.StartCounter(); 
    //AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N); 
    //gpuErrchk(cudaPeekAtLastError()); 
    //gpuErrchk(cudaDeviceSynchronize()); 
    //printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter()); 

    timerGPU.StartCounter(); 
    AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N); 
    gpuErrchk(cudaPeekAtLastError()); 
    gpuErrchk(cudaDeviceSynchronize()); 
    printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter()); 

    h_cells = d_cells; 

    h_x1 = d_x1; 
    h_x2 = d_x2; 

    // --- Check results 
    for (int k = 0; k < N; k++) { 
     if (h_x1[k] != k + 11) { 
      printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11); 
      break; 
     } 
     if (h_x2[k] != k + 12) { 
      printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12); 
      break; 
     } 
     if (h_cells[k].x1 != k + 11) { 
      printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11); 
      break; 
     } 
     if (h_cells[k].x2 != k + 12) { 
      printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12); 
      break; 
     } 
    } 

}

以下是定時（在GTX960執行的運行）：

Array of struct  9.1ms (v1 kernel) 
Struct of arrays  3.3ms (v3 kernel) 
Straight arrays  3.2ms (v2 kernel)

來源

2017-10-11 16:59:11 JackOLantern

很好的例子。 – ZeroCool

陣列與CUDA中結構陣列的結構

回答

相關問題