AVX標量運算速度更快

我測試這個簡單的函數AVX標量運算速度更快

void mul(double *a, double *b) { 
    for (int i = 0; i<N; i++) a[i] *= b[i]; 
}

具有非常大的陣列，這樣勢必內存帶寬。我使用的測試代碼如下。當我編譯-O2需要1.7秒。當我用-O2 -mavx進行編譯時，它只需要1.0秒。非vex編碼的標量操作速度要慢70％！ 這是爲什麼？

這是-O2和-O2 -mavx的程序集。

https://godbolt.org/g/w4p60f

系統：[email protected]（SKYLAKE微架構）32 GB MEM的，Ubuntu 16.10，GCC 6.3

測試代碼

//gcc -O2 -fopenmp test.c 
//or 
//gcc -O2 -mavx -fopenmp test.c 
#include <string.h> 
#include <stdio.h> 
#include <x86intrin.h> 
#include <omp.h> 

#define N 1000000 
#define R 1000 

void mul(double *a, double *b) { 
    for (int i = 0; i<N; i++) a[i] *= b[i]; 
} 

int main() { 
    double *a = (double*)_mm_malloc(sizeof *a * N, 32); 
    double *b = (double*)_mm_malloc(sizeof *b * N, 32); 

    //b must be initialized to get the correct bandwidth!!! 
    memset(a, 1, sizeof *a * N); 
    memset(b, 1, sizeof *b * N); 

    double dtime; 
    const double mem = 3*sizeof(double)*N*R/1024/1024/1024; 
    const double maxbw = 34.1; 
    dtime = -omp_get_wtime(); 
    for(int i=0; i<R; i++) mul(a,b); 
    dtime += omp_get_wtime(); 
    printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw); 

    _mm_free(a), _mm_free(b); 
}

來源

2017-04-06 Z boson

FWIW在低2.6 GHz移動Haswell CPU上，我用0.8秒左右的時間獲得了約0.8秒的編譯時間。 –

@PaulR，感謝您的檢查。我可以稍後在我的Haswell系統上進行測試。我在Skylake系統上得到了奇怪的結果，我沒有在Haswell上得到，所以我不會感到驚訝。 –

@PaulR，我只是想出了它！'__asm__ __volatile__（「vzeroupper」：：：）;'在調用'omp_get_wtime（）'後修正它。 –

問題涉及一種在撥打omp_get_wtime()後AVX寄存器的上半部分髒了。這對於Skylake處理器尤其是個問題。

我第一次讀到這個問題是here。自那時以來，其他人已經觀察到這個問題：here和here。使用gdb我發現omp_get_wtime()調用clock_gettime。我重寫了我的代碼，使用clock_gettime()，我看到了同樣的問題。步進通過代碼gdb

void fix_avx() { __asm__ __volatile__ ("vzeroupper" : : :); } 
void fix_sse() { } 
void (*fix)(); 

double get_wtime() { 
    struct timespec time; 
    clock_gettime(CLOCK_MONOTONIC, &time); 
    #ifndef __AVX__ 
    fix(); 
    #endif 
    return time.tv_sec + 1E-9*time.tv_nsec; 
} 

void dispatch() { 
    fix = fix_sse; 
    #if defined(__INTEL_COMPILER) 
    if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx; 
    #else 
    #if defined(__GNUC__) && !defined(__clang__) 
    __builtin_cpu_init(); 
    #endif 
    if(__builtin_cpu_supports("avx")) fix = fix_avx; 
    #endif 
}

我看到第一次clock_gettime被稱爲它調用_dl_runtime_resolve_avx()。我相信問題出在這個基於this comment的函數上。該功能似乎只在第一次調用clock_gettime時被調用。

隨着GCC的問題出在第一次調用與 clock_gettime然而，隨着鏘離開後使用 //__asm__ __volatile__ ("vzeroupper" : : :);

（使用clang -O2 -fno-vectorize因爲鏘甚至在-O2向量化），它只是消失在每次調用clock_gettime後使用。

這裏是我用來測試此代碼（與GCC 6.3和Clang的3.8）

#include <string.h> 
#include <stdio.h> 
#include <x86intrin.h> 
#include <time.h> 

void fix_avx() { __asm__ __volatile__ ("vzeroupper" : : :); } 
void fix_sse() { } 
void (*fix)(); 

double get_wtime() { 
    struct timespec time; 
    clock_gettime(CLOCK_MONOTONIC, &time); 
    #ifndef __AVX__ 
    fix(); 
    #endif 
    return time.tv_sec + 1E-9*time.tv_nsec; 
} 

void dispatch() { 
    fix = fix_sse; 
    #if defined(__INTEL_COMPILER) 
    if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx; 
    #else 
    #if defined(__GNUC__) && !defined(__clang__) 
    __builtin_cpu_init(); 
    #endif 
    if(__builtin_cpu_supports("avx")) fix = fix_avx; 
    #endif 
} 

#define N 1000000 
#define R 1000 

void mul(double *a, double *b) { 
    for (int i = 0; i<N; i++) a[i] *= b[i]; 
} 

int main() { 
    dispatch(); 
    const double mem = 3*sizeof(double)*N*R/1024/1024/1024; 
    const double maxbw = 34.1; 

    double *a = (double*)_mm_malloc(sizeof *a * N, 32); 
    double *b = (double*)_mm_malloc(sizeof *b * N, 32); 

    //b must be initialized to get the correct bandwidth!!! 
    memset(a, 1, sizeof *a * N); 
    memset(b, 1, sizeof *b * N); 

    double dtime; 
    //dtime = get_wtime(); // call once to fix GCC 
    //printf("%f\n", dtime); 
    //fix = fix_sse; 

    dtime = -get_wtime(); 
    for(int i=0; i<R; i++) mul(a,b); 
    dtime += get_wtime(); 
    printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw); 

    _mm_free(a), _mm_free(b); 
}

如果我關閉延遲函數調用分辨率-z now（如clang -O2 -fno-vectorize -z now foo.c）然後鏘後才需要__asm__ __volatile__ ("vzeroupper" : : :);第一次打電話給clock_gettime就像GCC。

我預計與-z now我只需要__asm__ __volatile__ ("vzeroupper" : : :);main()之後，但我仍然需要它後clock_gettime的第一個電話。

來源

2017-04-07 11:59:25

不錯的代碼！從[此gcc網頁]（https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html），我不清楚在調用之前是否必須調用'__builtin_cpu_init（void）' '__builtin_cpu_supports（「avx」）'或不。你有沒有在舊的非AVX cpu上測試你的代碼？ – wim

@wim，'dispatch'不應該被評論。那是因爲我測試GCC只需要調用'vzeroupperonce'而不是每次調用。我不知道'__builtin_cpu_init'。它沒有它的工作（雖然我沒有一個系統沒有AVX測試）。我把它添加到我的答案只是爲了安全。 –

'_dl_runtime_resolve_avx'僅在首次調用**時調用來自不同共享庫文件的某個函數。嘗試禁用惰性綁定（http://man7.org/linux/man-pages/man1/ld.1.html - 「懶惰..告訴動態鏈接器將函數調用解析延遲到函數被調用時的時間點（懶惰綁定），而不是在加載時間。懶惰綁定是默認值。「）與'export LD_BIND_NOW = 1'（http://man7.org/linux/man-pages/man8/ld.so.8.html - 」在程序啓動時解析所有符號，而不是推遲「）以禁止在運行時調用'_dl_runtime_resolve_avx'。 – osgx

AVX標量運算速度更快

回答

相關問題