快速生成嵌套for循環

爲可變數量的組合/可變指數循環快速生成嵌套for循環

#include <iostream> 
#include <array> 

template<size_t ... Rest> 
inline void index_generator() { 
    constexpr int size = sizeof...(Rest); 
    std::array<int,size> maxes = {Rest...}; 
    std::array<int,size> a; 
    int i,j; 
    std::fill(a.begin(),a.end(),0); 

    while(1) 
    { 
     for(i = 0; i<size; i++) { 
      std::cout << a[i] << " "; 
     } 
     std::cout << "\n"; 

     for(j = size-1 ; j>=0 ; j--) 
     { 
      if(++a[j]<maxes[j]) 
       break; 
      else 
       a[j]=0; 
     } 
     if(j<0) 
      break; 
    } 
} 

int main() 
{ 
    index_generator<2,3,3>(); 
    return 0; 
}

，其輸出如下

這確實相當於有

for (int i=0; i<2; ++i) 
    for (int j=0; j<3; ++j) 
     for (int k=0; i<3; ++k)

我可以生成等效使用上述方法的任何數量的nested for loops的nt，但是我已經注意到，隨着循環次數的增加，這個代碼與其等同的對應物（即，嵌套for循環）。我用gcc 5.3和clang 3.8檢查了兩者。也許這是由於處理器很難預測while(true)中的分支或者其他的東西。

我在最裏面的循環中所做的一般是訪問兩個數組中的數據並對它們進行乘法運算，如c_ptr[idx] +=a_ptr[idx]*b_ptr[idx]。由於使用嵌套for循環和使用上述技術生成的索引是相同的，所以內存訪問模式保持不變。所以我很確定，就數據訪問而言，這不是緩存未命中/命中問題。

所以我的問題是：

有沒有辦法以最快的速度嵌套for循環的代碼風格或潛在的甚至更快產生這些組合/指數？
由於我們知道要建立的for循環的數量以及for循環的索引在編譯時是已知的，因此不能利用更好的優化機會？比如SIMD？

來源

2016-05-20 Tasam Farkie

多少嵌套的循環，你呢？ – Jarod42

嵌套循環的數量沒有限制。 –

您可以使用所有維度的乘法的單個循環來生成它，並對最終索引使用模數。

#include <iostream> 
#include <array> 

template<size_t ... Rest> 
inline void index_generator() { 
    constexpr int size = sizeof...(Rest); 
    std::array<int, size> maxes = { Rest... }; 
    int total = 1; 
    for (int i = 0; i<size; ++i) { 
     total *= maxes[i]; 
    } 

    for (int i = 0; i < total; ++i) { 
     int remaining = total; 
     for (int n = 0; n < size; ++n) { 
      remaining /= maxes[n]; 
      std::cout << (i/remaining) % maxes[n] << " "; 
     } 
     std::cout << std::endl; 
    } 
}

或者只是生成遞歸模板來實際生成嵌套循環，並讓編譯器爲您優化它。這取決於指數的實際使用情況。現在你的功能不是太有用。

編輯：

基準上述三種溶液中，首先是一個在的問題，第二是我沒有陣列，以及thirs是遞歸的模板。最後一個有一個故障，它有點難以訪問使用的實際參數，但並非不可能。還必須添加總和計算，以免遭到優化，並且必須刪除控制檯輸出以減少基準測試中的影響。結果來自我的i7機器發佈模式（VS 2015社區）和下面的給定設置。 LOG和PROFILE_SCOPE是我的宏。

#include <array> 

// Original from the question 
template<size_t ... Rest> 
inline void index_generator1() { 
    constexpr int size = sizeof...(Rest); 
    std::array<int, size> maxes = { Rest... }; 
    std::array<int, size> a; 
    int i, j; 
    std::fill(a.begin(), a.end(), 0); 

    int x = 0; 

    while (1) { 
     for (i = 0; i < size; i++) { 
      x += a[i]; 
     } 

     for (j = size - 1; j >= 0; j--) { 
      if (++a[j] < maxes[j]) 
       break; 
      else 
       a[j] = 0; 
     } 
     if (j < 0) 
      break; 
    } 

    LOG(x) 
} 

// Initial try 
template<size_t ... Rest> 
inline void index_generator2() { 
    constexpr int size = sizeof...(Rest); 

    int x = 0; 

    std::array<int, size> maxes = { Rest... }; 
    int total = 1; 
    for (int i = 0; i < size; ++i) { 
     total *= maxes[i]; 
    } 

    for (int i = 0; i < total; ++i) { 
     int remaining = total; 
     for (int n = 0; n < size; ++n) { 
      remaining /= maxes[n]; 
      x += (i/remaining) % maxes[n]; 
     } 
    } 

    LOG(x) 
} 


// Recursive templates 
template <int... Args> 
struct Impl; 

template <int First, int... Args> 
struct Impl<First, Args...> 
{ 
    static int Do(int sum) 
    { 
     int x = 0; 
     for (int i = 0; i < First; ++i) { 
      x += Impl<Args...>::Do(sum + i); 
     } 

     return x; 
    } 
}; 

template <> 
struct Impl<> 
{ 
    static int Do(int sum) 
    { 
     return sum; 
    } 
}; 

template <int... Args> 
void index_generator3() 
{ 
    LOG(Impl<Args...>::Do(0)); 
}

執行的代碼在控制檯

{ 
    PROFILE_SCOPE(Index1) 
    index_generator1<200, 3, 400, 20>(); 
} 
{ 
    PROFILE_SCOPE(Index2) 
    index_generator2<200, 3, 400, 20>(); 
} 
{ 
    PROFILE_SCOPE(Index3) 
    index_generator3<200, 3, 400, 20>(); 
}

結果：

[19:35:50]: 1485600000 
[19:35:50]: 1485600000 
[19:35:50]: 1485600000 

[19:35:56]:   PerCall(ms) 
[19:35:56]: Index1  10.4016 
[19:35:56]: Index2  75.3770 
[19:35:56]: Index3  4.2299

來源

2016-05-20 19:32:45 Matzi

遞歸模板似乎是一個好主意。同時也證明了爲什麼C++是一種糟糕的語言。但是一個好主意。 – zmbq

@zmbq不要責怪語言，責怪程序員。 OP表明他的特殊模板解決方案比簡單的嵌套for循環產生更糟的代碼。那是誰的錯？ – user6362820

@Matzi你的解決方案確實比我上面發佈的慢很多，更不用說嵌套版本本身了。你的解決方案的真正問題是你使用了兩個'division's，並且消耗太多的CPU週期 –

快速生成嵌套for循環

回答

相關問題