頂點組件的AVX2重心插值

我剛開始使用simd intrinsics的路徑。我的分析器顯示，大量的時間花在頂點插值上。我的目標是AVX2，並試圖找到一個優化以下 - 鑑於我有3個vector2s需要插值我想我應該能夠將它們加載到一個__m256並做乘法和有效地添加。這裏是我試圖轉換的代碼 - 是否值得做256位操作？載體未對齊。頂點組件的AVX2重心插值

Vector2 Interpolate(Vector3 uvw, Vector2 v0, Vector2 v1, Vector2 v2) 
{ 
    Vector2 out; 
    out = v0 * uvw.x; 
    out += v1 * uvw.y; 
    out += v2 * uvw.z; 

    return out; 
} 

struct Vector2 { float x; float y; } ; 
struct Vector3 { float x; float y; float z; } ;

我的問題是這樣 - 我怎麼加載三個未對齊的vector2到單個256位寄存器，所以我可以做乘法和加法？

我正在使用VS2013。

來源

2017-01-22 Steven

太多煩人的數據移動將是必要的。如果你通過頂點（所有那些需要內插的頂點，不只是3個）和縮放因子作爲數組，你實際上可以編寫合理的代碼 – harold

@harold一次有多少個元素會使它值得呢？ 16套？ 256套？ – Steven

'struct Vector2_block {float8 x; float8 y; };'和'struct Vector3_block {float8 x; float8 y; float8 z; };'然後你一次操作8個頂點。 –

我很無聊，所以我寫的，沒有測試過（但是編譯，既鏘和GCC做出合理的代碼本）

void interpolateAll(int n, float* scales, float* vin, float* vout) 
{ 
    // preconditions: 
    // (n & 7 == 0) (not really, but vout must be padded) 
    // scales & 31 == 0 
    // vin & 31 == 0 
    // vout & 31 == 0 

    // vin format: 
    // float v0x[8] 
    // float v0y[8] 
    // float v1x[8] 
    // float v1y[8] 
    // float v2x[8] 
    // float v2y[8] 
    // scales format: 
    // float scale0[8] 
    // float scale1[8] 
    // float scale2[8] 
    // vout format: 
    // float vx[8] 
    // float vy[8] 

    for (int i = 0; i < n; i += 8) { 
    __m256 scale_0 = _mm256_load_ps(scales + i * 3); 
    __m256 scale_1 = _mm256_load_ps(scales + i * 3 + 8); 
    __m256 scale_2 = _mm256_load_ps(scales + i * 3 + 16); 
    __m256 v0x = _mm256_load_ps(vin + i * 6); 
    __m256 v0y = _mm256_load_ps(vin + i * 6 + 8); 
    __m256 v1x = _mm256_load_ps(vin + i * 6 + 16); 
    __m256 v1y = _mm256_load_ps(vin + i * 6 + 24); 
    __m256 v2x = _mm256_load_ps(vin + i * 6 + 32); 
    __m256 v2y = _mm256_load_ps(vin + i * 6 + 40); 
    __m256 x = _mm256_mul_ps(scale_0, v0x); 
    __m256 y = _mm256_mul_ps(scale_0, v0y); 
    x = _mm256_fmadd_ps(scale_1, v1x, x); 
    y = _mm256_fmadd_ps(scale_1, v1y, y); 
    x = _mm256_fmadd_ps(scale_2, v2x, x); 
    y = _mm256_fmadd_ps(scale_2, v2y, y); 
    _mm256_store_ps(vout + i * 2, x); 
    _mm256_store_ps(vout + i * 2 + 8, y); 
    } 
}

使用z玻色子的格式，如果我理解正確了。無論如何，從SIMD的角度來看，這是一個很好的格式。從C++的角度來看，這有點不方便。

FMAs會不必要地序列化乘法，但這應該不重要，因爲它不是循環承載的依賴項的一部分。

預測的吞吐量（假設一個足夠小的陣列）是每9個週期2次迭代，由負載瓶頸。在實踐中可能會稍微惡化一些，有些人談論簡單的商店偶爾會偷盜p2或p3，但我不太確定。無論如何，對於18個「FMA」來說足夠的時間，但只有12個（8和4個mulps），所以如果有的話，在這裏移動一些額外的計算可能是有用的。

來源

2017-01-27 19:05:22 harold

謝謝 - 當我可以將源數據重新配置爲流時，我會試一試。現在光線追蹤器真的被設置在「每光線」結果上。 – Steven

這通常是我最終做的 - 我重寫了代碼，一次生成8條射線，然後使其他所有條件匹配。它在大部分情況下工作，但是有一些昂貴的操作，我收集了具有不同數據的命中對象並且不得不將其拖垮，但這是另一個問題。再次感謝。 – Steven

頂點組件的AVX2重心插值

回答

相關問題