GCC爲什麼不自動矢量化這個循環？

我有以下的C程序（我的實際使用情況的簡化表現出相同的行爲）GCC爲什麼不自動矢量化這個循環？

#include <stdlib.h> 
#include <math.h> 
int main(int argc, char ** argv) { 
    const float * __restrict__ const input = malloc(20000*sizeof(float)); 
    float * __restrict__ const output = malloc(20000*sizeof(float)); 

    unsigned int pos=0; 
    while(1) { 
      unsigned int rest=100; 
      for(unsigned int i=pos;i<pos+rest; i++) { 
        output[i] = input[i] * 0.1; 
      } 

      pos+=rest;    
      if(pos>10000) { 
        break; 
      } 
    } 
}

當我與

-O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native --std=c99 -fPIC -ffast-math

編譯我得到的輸出

main.c:10: note: not vectorized: unhandled data-ref

其中10是內循環的行。當我查詢它爲什麼會這樣說時，它似乎是說指針可能是別名，但它們不能在我的代碼中，因爲我有__restrict關鍵字。他們還建議包括-msse標誌，但他們似乎也沒有做任何事情。任何幫助？

來源

2011-02-16 Jeremy Salwen

什麼版本的gcc？一個可行的例子也可能是有用的，因爲當我嘗試使用4.4.5 – ergosys 2011-02-16 23:15:16

進行向量化時，你可以發佈編譯的代碼示例嗎？當我填充了一些虛擬值時，循環被矢量化了...... – Christoph 2011-02-16 23:15:39

@ergosys：他說的;） – Christoph 2011-02-16 23:16:00

它肯定看起來像一個錯誤。在下文中，相同的功能，是foo()但矢量化是bar()不是，編譯時對於x86-64的目標：

void foo(const float * restrict input, float * restrict output) 
{ 
    unsigned int pos; 
    for (pos = 0; pos < 10100; pos++) 
     output[pos] = input[pos] * 0.1; 
} 

void bar(const float * restrict input, float * restrict output) 
{ 
    unsigned int pos; 
    unsigned int i; 
    for (pos = 0; pos <= 10000; pos += 100) 
     for (i = 0; i < 100; i++) 
      output[pos + i] = input[pos + i] * 0.1; 
}

添加-m32標誌，編譯爲一個x86目標代替，導致要矢量化兩種功能。

來源

2011-02-17 04:32:02 caf

嘗試：

const float * __restrict__ input = ...; 
float * __restrict__ output = ...;

實驗了一下週圍改變事物：

#include <stdlib.h> 
#include <math.h> 

int main(int argc, char ** argv) { 

    const float * __restrict__ input = new float[20000]; 
    float * __restrict__ output = new float[20000]; 

    unsigned int pos=0; 
    while(1) { 
     unsigned int rest=100; 
     output += pos; 
     input += pos; 
     for(unsigned int i=0;i<rest; ++i) { 
      output[i] = input[i] * 0.1; 
     } 

     pos+=rest; 
     if(pos>10000) { 
      break; 
     } 
    } 
} 

g++ -O3 -g -Wall -ftree-vectorizer-verbose=7 -msse -msse2 -msse3 -c test.cpp 

test.cpp:14: note: versioning for alias required: can't determine dependence between *D.4096_24 and *D.4095_21 
test.cpp:14: note: mark for run-time aliasing test between *D.4096_24 and *D.4095_21 
test.cpp:14: note: Alignment of access forced using versioning. 
test.cpp:14: note: Vectorizing an unaligned access. 
test.cpp:14: note: vect_model_load_cost: unaligned supported by hardware. 
test.cpp:14: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . 
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 . 
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 1 . 
test.cpp:14: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 . 
test.cpp:14: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 . 
test.cpp:14: note: cost model: Adding cost of checks for loop versioning to treat misalignment. 

test.cpp:14: note: cost model: Adding cost of checks for loop versioning aliasing. 

test.cpp:14: note: Cost model analysis: 
    Vector inside of loop cost: 8 
    Vector outside of loop cost: 6 
    Scalar iteration cost: 5 
    Scalar outside cost: 1 
    prologue iterations: 0 
    epilogue iterations: 0 
    Calculated minimum iters for profitability: 2 

test.cpp:14: note: Profitability threshold = 3 

test.cpp:14: note: Vectorization may not be profitable. 
test.cpp:14: note: create runtime check for data references *D.4096_24 and *D.4095_21 
test.cpp:14: note: created 1 versioning for alias checks. 

test.cpp:14: note: LOOP VECTORIZED. 
test.cpp:4: note: vectorized 1 loops in function. 

Compilation finished at Wed Feb 16 19:17:59

來源

2011-02-16 22:58:46 Anycorn

它不喜歡它防止它理解內環外環格式。我可以讓它向量化，如果我只是把它摺疊成一個單一的循環：

#include <stdlib.h> 
#include <math.h> 
int main(int argc, char ** argv) { 
    const float * __restrict__ input = malloc(20000*sizeof(float)); 
    float * __restrict__ output = malloc(20000*sizeof(float)); 

    for(unsigned int i=0; i<=10100; i++) { 
      output[i] = input[i] * 0.1f; 
    } 
}

（請注意，我並沒有想太多難以有關如何在POS +限休息正確轉換成一個單一的循環條件，它可能是錯誤的）

你可能可以利用這個優勢，把一個簡化的內部循環放入一個你用指針和計數調用的函數中。即使再次內聯，它也可以正常工作。假設你刪除了我剛剛簡化過的while()循環的部分內容，但需要保留。

來源

2011-02-17 01:02:22

GCC爲什麼不自動矢量化這個循環？

回答

相關問題