我一直在使用英特爾的SSE內部函數,並獲得了很好的性能提升。因此,我期望AVX內部函數能夠進一步加速我的程序。不幸的是,直到現在,情況並非如此。可能我正在犯一個愚蠢的錯誤,所以如果有人能幫助我,我將非常感激。使用AVX intrinsics而不是SSE不會提高速度 - 爲什麼?
我使用Ubuntu 11.10和g ++ 4.6.1。我編譯了一個程序(見下文),同時
g++ simpleExample.cpp -O3 -march=native -o simpleExample
測試系統具有英特爾i7-2600的CPU。
下面是代表我的問題的代碼。在我的系統,我得到的輸出
98.715 ms, b[42] = 0.900038 // Naive
24.457 ms, b[42] = 0.900038 // SSE
24.646 ms, b[42] = 0.900038 // AVX
注意的是,計算的sqrt(平方根(SQRT(X)))只選擇,以保證內存帶寬沒有限制的執行速度;這只是一個例子。
simpleExample.cpp:
#include <immintrin.h>
#include <iostream>
#include <math.h>
#include <sys/time.h>
using namespace std;
// -----------------------------------------------------------------------------
// This function returns the current time, expressed as seconds since the Epoch
// -----------------------------------------------------------------------------
double getCurrentTime(){
struct timeval curr;
struct timezone tz;
gettimeofday(&curr, &tz);
double tmp = static_cast<double>(curr.tv_sec) * static_cast<double>(1000000)
+ static_cast<double>(curr.tv_usec);
return tmp*1e-6;
}
// -----------------------------------------------------------------------------
// Main routine
// -----------------------------------------------------------------------------
int main() {
srand48(0); // seed PRNG
double e,s; // timestamp variables
float *a, *b; // data pointers
float *pA,*pB; // work pointer
__m128 rA,rB; // variables for SSE
__m256 rA_AVX, rB_AVX; // variables for AVX
// define vector size
const int vector_size = 10000000;
// allocate memory
a = (float*) _mm_malloc (vector_size*sizeof(float),32);
b = (float*) _mm_malloc (vector_size*sizeof(float),32);
// initialize vectors //
for(int i=0;i<vector_size;i++) {
a[i]=fabs(drand48());
b[i]=0.0f;
}
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// Naive implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
s = getCurrentTime();
for (int i=0; i<vector_size; i++){
b[i] = sqrtf(sqrtf(sqrtf(a[i])));
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
// -----------------------------------------------------------------------------
for(int i=0;i<vector_size;i++) {
b[i]=0.0f;
}
// -----------------------------------------------------------------------------
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// SSE2 implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
pA = a; pB = b;
s = getCurrentTime();
for (int i=0; i<vector_size; i+=4){
rA = _mm_load_ps(pA);
rB = _mm_sqrt_ps(_mm_sqrt_ps(_mm_sqrt_ps(rA)));
_mm_store_ps(pB,rB);
pA += 4;
pB += 4;
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
// -----------------------------------------------------------------------------
for(int i=0;i<vector_size;i++) {
b[i]=0.0f;
}
// -----------------------------------------------------------------------------
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// AVX implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
pA = a; pB = b;
s = getCurrentTime();
for (int i=0; i<vector_size; i+=8){
rA_AVX = _mm256_load_ps(pA);
rB_AVX = _mm256_sqrt_ps(_mm256_sqrt_ps(_mm256_sqrt_ps(rA_AVX)));
_mm256_store_ps(pB,rB_AVX);
pA += 8;
pB += 8;
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
_mm_free(a);
_mm_free(b);
return 0;
}
任何幫助表示讚賞!
我不知道,AVX是有史以來模擬 - 你對此有一個參考?在這種情況下,哪些CPU會特別如此? –
在Sandy Bridge上,根據[指令表](http://www.agner.org/optimize/instruction_tables.pdf),第87-88頁,似乎VDIVPS/PD在端口0上執行2個微操作,而「DIVPS/PS」則只有1個microop。 'SQRT'指令將類似。由於除法單元沒有流水線,所以執行時間延長了兩倍。這表明桑迪橋實際上只有128位執行部門單位。 –
@Norbert:感謝您的澄清 - 我不知道 –