使用sse和avx intrinsics將一組壓縮單曲添加到一個值

我有我試圖加快的代碼。首先，我使用了SSE內在因素，並看到顯着的收益。我現在試圖看看我是否可以用AVX內部函數做類似的工作。代碼基本上需要兩個數組，根據需要添加或減去它們，對結果進行平方，然後將所有這些方塊相加在一起。使用sse和avx intrinsics將一組壓縮單曲添加到一個值

下面是一個使用上證所內部函數的代碼的略微簡化的版本：

float chiList[4] __attribute__((aligned(16))); 
float chi = 0.0; 
__m128 res; 
__m128 nres; 
__m128 del; 
__m128 chiInter2; 
__m128 chiInter; 
while(runNum<boundary) 
{ 
    chiInter = _mm_setzero_ps(); 
    for(int i=0; i<maxPts; i+=4) 
    { 
     //load the first batch of residuals and deltas 
     res = _mm_load_ps(resids+i); 
     del = _mm_load_ps(residDeltas[param]+i); 
     //subtract them 
     nres = _mm_sub_ps(res,del); 
     //load them back into memory 
     _mm_store_ps(resids+i,nres); 
     //square them and add them back to chi with the fused 
     //multiply and add instructions 
     chiInter = _mm_fmadd_ps(nres, nres, chiInter); 
    } 
    //add the 4 intermediate this way because testing 
    //shows it is faster than the commented out way below 
    //so chiInter2 has chiInter reversed 
    chiInter2 = _mm_shuffle_ps(chiInter,chiInter,_MM_SHUFFLE(0,1,2,3)); 
    //add the two 
    _mm_store_ps(chiList,_mm_add_ps(chiInter,chiInter2)); 
    //add again 
    chi=chiList[0]+chiList[1]; 
    //now do stuff with the chi^2 
    //alternatively, the slow way 
    //_mm_store_ps(chiList,chiInter); 
    //chi=chiList[0]+chiList[1]+chiList[2]+chiList[3]; 
}

這讓我對我的第一個問題：有沒有辦法做的最後一位（其中我走的4個漂浮在chiInter中並將它們合併爲一個漂浮物）更優雅？

不管怎麼說，我現在試圖用avx intrinsics來實現這個過程，這個過程大部分很簡單，不幸的是我拖延試圖做最後一點，試圖將8箇中間chi值壓縮成單個值。

下面是對AVX內在類似簡化的一段代碼：

float chiList[8] __attribute__((aligned(32))); 
__m256 res; 
__m256 del; 
__m256 nres; 
__m256 chiInter; 
while(runNum<boundary) 
{ 
    chiInter = _mm256_setzero_ps(); 
    for(int i=0; i<maxPts; i+=8) 
    { 
     //load the first batch of residuals and deltas 
     res = _mm256_load_ps(resids+i); 
     del = _mm256_load_ps(residDeltas[param]+i); 
     //subtract them 
     nres = _mm256_sub_ps(res,del); 
     //load them back into memory 
     _mm256_store_ps(resids+i,nres); 
     //square them and add them back to chi with the fused 
     //multiply and add instructions 
     chiInter = _mm256_fmadd_ps(nres, nres, chiInter); 
    } 
    _mm256_store_ps(chiList,chiInter); 
    chi=chiList[0]+chiList[1]+chiList[2]+chiList[3]+ 
     chiList[4]+chiList[5]+chiList[6]+chiList[7]; 
}

我的第二個問題是這樣的：是有一些方法，就像我與上證所了上面拉着讓我完成這最後的另外更快速？或者，如果有更好的方法來做我在SSE內在函數中做的事情，它是否具有與AVX內在函數相同的功能？

來源

2014-03-28 James Matta

不要太擔心最後總和的效率 - 假設'maxPts'相當大，那麼總時間將由for循環內的內容占主導地位，並且任何前導/後繼代碼將是不相關的性能-明智的。 –

@PaulR，不幸的是，maxPts很小，通常不會超過32個。是的，儘管尺寸很小，但使用sse vs天真循環，即144ns /迭代 - > 14ns /迭代，我看到了巨大的收益。 –

請參閱相關鏈接：http：//stackoverflow.com/q/9775538/1918193。我很驚訝你沒有嘗試使用haddps。要搜索的關鍵字：水平添加/和。 –

該操作被稱爲水平和。假設你有一個向量v={x0,x1,x2,x3,x4,x5,x6,x7}。首先，提取高/低部分，以便您有w1={x0,x1,x2,x3}和w2={x4,x5,x6,x7}。現在撥打_mm_hadd_ps(w1, w2)，給出：tmp1={x0+x1,x2+x3,x4+x5,x6+x7}。再次，_mm_hadd_ps(tmp1,tmp1)給出tmp2={x0+x1+x2+x3,x4+x5+x6+x7,...}。最後一次，_mm_hadd_ps(tmp2,tmp2)給出tmp3={x0+x1+x2+x3+x4+x5+x6+x7,...}。您也可以用簡單的_mm_add_ps替換第一個_mm_hadd_ps。

這是所有未經測試並從文檔寫入。並沒有承諾的速度要麼...

有人在Intel forum顯示另一個變種（尋找HsumAvxFlt）。

我們也可以看看通過編譯此代碼gcc test.c -Ofast -mavx2 -S

float f(float*t){ 
    t=(float*)__builtin_assume_aligned(t,32); 
    float r=0; 
    for(int i=0;i<8;i++) 
    r+=t[i]; 
    return r; 
}

生成test.s暗示什麼GCC包含：

vhaddps %ymm0, %ymm0, %ymm0 
vhaddps %ymm0, %ymm0, %ymm1 
vperm2f128 $1, %ymm1, %ymm1, %ymm0 
vaddps %ymm1, %ymm0, %ymm0

我有點驚訝的最後的指令不是vaddss，但我想這並不重要。

來源

2014-03-28 20:36:36

哇，這有助於很多。非常感謝你。我一直在通過搜索和測試來寫我的改進，偶爾也會遇到困難。內在因素拯救了我，因爲我真的不想寫內聯程序集，但直到一週前我才知道它們。 –

使用sse和avx intrinsics將一組壓縮單曲添加到一個值

回答

相關問題