分配數組減慢計算

我正在使用餘弦相似性儘快找到兩組中最相似的向量的代碼。
該代碼使用原始數組（即速度和簡單性），並且我開始注意到當我分配更多數組時，即使我根本沒有更改計算，程序也會變得更慢。我設法把蒸餾程序如下百元左右線不失問題：分配數組減慢計算

#include <iostream> 

const int vec_len = 192; 

struct fvec 
{ 
    int64_t nvec; 
    short int **vecs; 
#ifdef PARTIALS 
    int **partials; 
#endif 
    fvec(int size) 
    { 
     nvec = size; 
     vecs = new short int *[nvec]; 
#ifdef PARTIALS 
     partials = new int *[nvec]; 
#endif 
     for (int64_t i = 0; i < nvec; i++) 
     { 
      vecs[i] = new short int[vec_len]; 
#ifdef PARTIALS 
      partials[i] = new int[vec_len]; 
#endif 
      for (int j = 0; j < vec_len; j++) vecs[i][j] = std::rand() * 10000/RAND_MAX; 
     } 
    } 
    ~fvec() 
    { 
     for (int64_t i = 0; i < nvec; i++) 
     { 
      delete[] vecs[i]; 
#ifdef PARTIALS 
      delete[] partials[i]; 
#endif 
     } 
     delete[] vecs; 
#ifdef PARTIALS 
     delete[] partials; 
#endif 
    } 
}; 

struct cvec 
{ 
    int nvec; 
    short int **vecs; 
#ifdef PARTIALS 
    int **partials; 
#endif 
    cvec(int size) 
    { 
     nvec = size; 
     vecs = new short int *[nvec]; 
#ifdef PARTIALS 
     partials = new int *[nvec]; 
#endif 
     for (int nv = 0; nv < nvec; nv++) 
     { 
      vecs[nv] = new short int[vec_len]; 
#ifdef PARTIALS 
      partials[nv] = new int[vec_len]; 
#endif 
      for (int i = 0; i < vec_len; i++) vecs[nv][i] = std::rand() * 10000/RAND_MAX; 
     } 
    } 
    ~cvec() 
    { 
     for (int i = 0; i < nvec; i++) 
     { 
      delete[] vecs[i]; 
#ifdef PARTIALS 
      delete[] partials[i]; 
#endif 
     } 
     delete[] vecs; 
#ifdef PARTIALS 
     delete[] partials; 
#endif 
    } 
}; 

float sim(short int *a, short int *b) 
{ 
    int ret = 0; 
    for (int i = 0; i < vec_len; i++) ret += a[i] * b[i]; 
    return ret; 
} 

void iterative_nn(const cvec &c, const fvec &f, int *results) 
{ 
    for (int64_t i = 0; i < f.nvec; i++) 
    { 
     results[i] = 0; 
     for (int j = 0; j < c.nvec; j++) 
     { 
      float tmpsim = sim(f.vecs[i], c.vecs[j]); 
      if (tmpsim > results[i]) results[i] = tmpsim; 
     } 
     if (i % 100 == 0) std::cout << "\r" << i << std::flush; 
    } 
} 

int main(int argc, char **argv) 
{ 
    int res[5000]; 
    iterative_nn(cvec{100000}, fvec{5000}, res); 
    std::cout << "\n"; 
    return 0; 
}

正如你所看到的，我有拿着兩套陣列的兩個班。我用隨機值填充兩組數組（用於演示），然後調用遍歷所有數組並計算其相似性的函數。
當我通過在命令行中指定-DPARTIALS將另一組數組添加到每個類時，程序在我的計算機上速度降低到大約一半。很顯然，該指令所觸及的唯一行是分配和釋放附加數組！
此外，額外的時間不用於分配和釋放，在這兩種情況下都需要不到一秒的時間。額外的時間花費在迭代搜索中，這是不受指令影響的（或者我認爲）。因此，我的問題是：僅僅分配額外的數組會讓我的程序減慢一半，這是什麼原因？

上面的代碼希望用-std = C++ 11編譯。如果我使用-O3，它會在大約25秒或1分鐘內運行。

來源

2013-10-09 Matthew Schauer

動態分配和釋放是昂貴的，'的std :: VECTOR'會有所幫助，但是由於尺寸固定，爲什麼不只是一個正常的數組在堆棧或'std :: array'？ –

所以，你很困惑，因爲你預期動態內存分配在性能方面是免費的嗎？這與預期將'new'實現爲0行代碼是否相同？否則，如果你認爲它代表你做了一些工作，爲什麼它讓你感到驚訝，它也需要時間來做到這一點？我很困惑你的困惑。 :)在「內循環」類型的代碼中避免動態內存分配是一個非常基本的優化技巧，看起來你剛剛被教過。 – unwind

@unwind嗯，OP聲稱減速在分配中不是*，而是在搜索中（實際上並未使用分配）。不過，看看這個信念來自哪裏肯定會很有趣。 – Angew

有兩個因素造成的性能下降：

更多的高速緩存命中時CPU從存儲器加載在計算循環中的數據失敗會發生。
新增和刪除時間。

我已經將下面的代碼移動到單獨的循環中，它顯着改善了性能，我相信這是因爲項目＃1。

#ifdef PARTIALS 
      partials[nv] = new int[vec_len]; 
#endif

一部開拓創新的代碼，而無需諧音：1m16s。
帶部分信號的代碼：1m40s。
單獨的迴路沒有部分：1m16s。
單獨環與部分：1m20s。

所以在我的情況下＃1大約需要4秒。高速緩存未命中需要大約20秒。

更改後的代碼如下（我建立與O3而不是與C11）：

#include <iostream> 

const int vec_len = 192; 

struct fvec 
{ 
    int64_t nvec; 
    short int **vecs; 
#ifdef PARTIALS 
    int **partials; 
#endif 
    fvec(int size) 
    { 
     nvec = size; 
     vecs = new short int *[nvec]; 
#ifdef PARTIALS 
     partials = new int *[nvec]; 
#endif 
#ifdef PARTIALS // <<<<< put it here in an separator loop. 
     for (int64_t i = 0; i < nvec; i++) 
     { 
      partials[i] = new int[vec_len]; 
     } 
#endif 
     for (int64_t i = 0; i < nvec; i++) 
     { 
      vecs[i] = new short int[vec_len]; 
      for (int j = 0; j < vec_len; j++) vecs[i][j] = std::rand() * 10000/RAND_MAX; 
     } 
    } 
    ~fvec() 
    { 
     for (int64_t i = 0; i < nvec; i++) 
     { 
      delete[] vecs[i]; 
#ifdef PARTIALS 
      delete[] partials[i]; 
#endif 
     } 
     delete[] vecs; 
#ifdef PARTIALS 
     delete[] partials; 
#endif 
    } 
}; 

struct cvec 
{ 
    int nvec; 
    short int **vecs; 
#ifdef PARTIALS 
    int **partials; 
#endif 
    cvec(int size) 
    { 
     nvec = size; 
     vecs = new short int *[nvec]; 
#ifdef PARTIALS 
     partials = new int *[nvec]; 
#endif 

#ifdef PARTIALS // <<<<< put it here in an separator loop. 
     for (int nv = 0; nv < nvec; nv++) 
     { 
      partials[nv] = new int[vec_len]; 
     } 
#endif 

     for (int nv = 0; nv < nvec; nv++) 
     { 
      vecs[nv] = new short int[vec_len]; 
      for (int i = 0; i < vec_len; i++) vecs[nv][i] = std::rand() * 10000/RAND_MAX; 
     } 
    } 
    ~cvec() 
    { 
#ifdef PARTIALS 
     for (int i = 0; i < nvec; i++) 
     { 
      delete[] partials[i]; 
     } 
#endif 

     for (int i = 0; i < nvec; i++) 
     { 
      delete[] vecs[i]; 
     } 
     delete[] vecs; 
#ifdef PARTIALS 
     delete[] partials; 
#endif 
    } 
}; 

float sim(short int *a, short int *b) 
{ 
    int ret = 0; 
    for (int i = 0; i < vec_len; i++) ret += a[i] * b[i]; 
    return ret; 
} 

void iterative_nn(const cvec &c, const fvec &f, int *results) 
{ 
    for (int64_t i = 0; i < f.nvec; i++) 
    { 
     results[i] = 0; 
     for (int j = 0; j < c.nvec; j++) 
     { 
      float tmpsim = sim(f.vecs[i], c.vecs[j]); 
      if (tmpsim > results[i]) results[i] = tmpsim; 
     } 
     if (i % 100 == 0) std::cout << "\r" << i << std::flush; 
    } 
} 

int main(int argc, char **argv) 
{ 
    int res[5000]; 
    iterative_nn(cvec(100000), fvec(5000), res); 
    std::cout << "\n"; 
    return 0; 
}

來源

2013-10-09 10:14:00 ZijingWu

哇，對吧！謝謝！但我仍然不確定原因。我認爲可能會涉及緩存未命中，但是如果部分內存沒有被使用，是否不應該很快從緩存中刪除？ –

@MatthewSchauer原因是沒有'PARTIALS'分配的內存將具有良好的連續性。你可以打印出'vecs [i]'的內存地址，在我的機器上每個塊都沒有'PARTIALS'。我也想說'cvec'中的內存連續性是主要原因，因爲它已經在'iterative_nn'的內部循環中訪問，只是禁用'cvec'內的PARTIALS環繞代碼將使它幾乎像沒有部分在整個程序中。 – ZijingWu

@MatthewSchauer，另一件可以證實這一點的事情是，如果在'sim'的循環中將'i ++'更改爲'i - '，將會使其在沒有PARTIALS的情況下運行約1m40秒。 – ZijingWu

分配數組減慢計算

回答

相關問題