如何解釋緩存未命中的cachegrind輸出？

出於好奇，我跑了幾個不同版本的矩陣乘法編碼，並運行cachegrind。在下面的結果中，我想知道哪些部分是L1，L2，L3缺失和參考，以及它們的真正含義？下面是我的矩陣乘法的代碼，以防萬一需要。如何解釋緩存未命中的cachegrind輸出？

#define SLOWEST 
==6933== Cachegrind, a cache and branch-prediction profiler 
==6933== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==6933== Command: ./a.out 500 
==6933== 
--6933-- warning: L3 cache found, using its data for the LL simulation. 
--6933-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 60.7487 seconds. 
==6933== 
==6933== I refs:  6,039,791,314 
==6933== I1 misses:   1,611 
==6933== LLi misses:   1,519 
==6933== I1 miss rate:   0.00% 
==6933== LLi miss rate:   0.00% 
==6933== 
==6933== D refs:  2,892,704,678 (2,763,005,485 rd + 129,699,193 wr) 
==6933== D1 misses:  136,223,560 ( 136,174,705 rd +  48,855 wr) 
==6933== LLd misses:   53,675 (  5,247 rd +  48,428 wr) 
==6933== D1 miss rate:   4.7% (   4.9%  +   0.0% ) 
==6933== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==6933== 
==6933== LL refs:   136,225,171 ( 136,176,316 rd +  48,855 wr) 
==6933== LL misses:   55,194 (  6,766 rd +  48,428 wr) 
==6933== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define SLOWER 
==8463== Cachegrind, a cache and branch-prediction profiler 
==8463== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==8463== Command: ./a.out 500 
==8463== 
--8463-- warning: L3 cache found, using its data for the LL simulation. 
--8463-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 49.7397 seconds. 
==8463== 
==8463== I refs:  4,537,213,120 
==8463== I1 misses:   1,571 
==8463== LLi misses:   1,487 
==8463== I1 miss rate:   0.00% 
==8463== LLi miss rate:   0.00% 
==8463== 
==8463== D refs:  2,891,485,608 (2,761,862,312 rd + 129,623,296 wr) 
==8463== D1 misses:  59,961,522 ( 59,913,256 rd +  48,266 wr) 
==8463== LLd misses:   53,113 (  5,246 rd +  47,867 wr) 
==8463== D1 miss rate:   2.0% (   2.1%  +   0.0% ) 
==8463== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==8463== 
==8463== LL refs:   59,963,093 ( 59,914,827 rd +  48,266 wr) 
==8463== LL misses:   54,600 (  6,733 rd +  47,867 wr) 
==8463== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define SLOW 
==9174== Cachegrind, a cache and branch-prediction profiler 
==9174== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==9174== Command: ./a.out 500 
==9174== 
--9174-- warning: L3 cache found, using its data for the LL simulation. 
--9174-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 35.8901 seconds. 
==9174== 
==9174== I refs:  3,039,713,059 
==9174== I1 misses:   1,570 
==9174== LLi misses:   1,486 
==9174== I1 miss rate:   0.00% 
==9174== LLi miss rate:   0.00% 
==9174== 
==9174== D refs:  1,893,235,586 (1,763,112,301 rd + 130,123,285 wr) 
==9174== D1 misses:  63,285,950 ( 62,987,684 rd +  298,266 wr) 
==9174== LLd misses:   53,113 (  5,246 rd +  47,867 wr) 
==9174== D1 miss rate:   3.3% (   3.5%  +   0.2% ) 
==9174== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==9174== 
==9174== LL refs:   63,287,520 ( 62,989,254 rd +  298,266 wr) 
==9174== LL misses:   54,599 (  6,732 rd +  47,867 wr) 
==9174== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define MEDIUM 
==7838== Cachegrind, a cache and branch-prediction profiler 
==7838== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==7838== Command: ./a.out 500 
==7838== 
--7838-- warning: L3 cache found, using its data for the LL simulation. 
--7838-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 23.4097 seconds. 
==7838== 
==7838== I refs:  2,548,967,151 
==7838== I1 misses:   1,610 
==7838== LLi misses:   1,522 
==7838== I1 miss rate:   0.00% 
==7838== LLi miss rate:   0.00% 
==7838== 
==7838== D refs:  1,399,237,303 (1,267,363,440 rd + 131,873,863 wr) 
==7838== D1 misses:   592,807 (  293,091 rd +  299,716 wr) 
==7838== LLd misses:   53,147 (  5,248 rd +  47,899 wr) 
==7838== D1 miss rate:   0.0% (   0.0%  +   0.2% ) 
==7838== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==7838== 
==7838== LL refs:    594,417 (  294,701 rd +  299,716 wr) 
==7838== LL misses:   54,669 (  6,770 rd +  47,899 wr) 
==7838== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define MEDIUMISH 
==8438== Cachegrind, a cache and branch-prediction profiler 
==8438== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==8438== Command: ./a.out 500 
==8438== 
--8438-- warning: L3 cache found, using its data for the LL simulation. 
--8438-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 24.0327 seconds. 
==8438== 
==8438== I refs:  2,550,211,553 
==8438== I1 misses:   1,576 
==8438== LLi misses:   1,488 
==8438== I1 miss rate:   0.00% 
==8438== LLi miss rate:   0.00% 
==8438== 
==8438== D refs:  1,400,107,343 (1,267,610,303 rd + 132,497,040 wr) 
==8438== D1 misses:   339,977 (  42,583 rd +  297,394 wr) 
==8438== LLd misses:   53,114 (  5,248 rd +  47,866 wr) 
==8438== D1 miss rate:   0.0% (   0.0%  +   0.2% ) 
==8438== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==8438== 
==8438== LL refs:    341,553 (  44,159 rd +  297,394 wr) 
==8438== LL misses:   54,602 (  6,736 rd +  47,866 wr) 
==8438== LL miss rate:   0.0% (   0.0%  +   0.0% )

矩陣乘法碼。

#if defined(SLOWEST) 
    void multiply (float **A, float **B, float **out, int size) { 
     for (int row=0;row<size;row++) 
      for (int col=0;col<size;col++) 
       for (int in=0;in<size;in++) 
        out[row][col] += A[row][in] * B[in][col]; 
    } 
// Takes in 1-D arrays, same as before. 
#elif defined(SLOWER) 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int row=0;row<size;row++) 
      for (int col=0;col<size;col++) 
       for (int in=0;in<size;in++) 
        out[row * size + col] += A[row * size + in] * B[in * size + col]; 
    } 
// Flips first and second loops 
#elif defined(SLOW) 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int col=0;col<size;col++) 
      for (int row=0;row<size;row++) { 
       float curr = 0; // prevents from calculating position each time through 
       for (int in=0;in<size;in++) 
        curr += A[row * size + in] * B[in *size + col]; 
       out[row * size + col] = curr; 
      } 
    } 
#elif defined(MEDIUM) 
    // Keeps it organized for future codes. 
    float dotProduct(float *A, float *B, int size) { 
     float curr = 0; 

     for (int i=0;i<size;i++) 
      curr += A[i] * B[i]; 

     return curr; 
    } 
    void multiply (float *A, float *B, float *out, int size) { 
     float *temp = new float[size]; 

     for (int col=0;col<size;col++) { 
      for (int i=0;i<size;i++) // stores column into sequential array 
       temp[i] = B[i * size + col]; 
      for (int row=0;row<size;row++) 
       out[row * size + col] = dotProduct(&A[row], temp, size); // uses function above for dot product. 
     } 

     delete[] temp; 
    } 
#elif defined(MEDIUMISH) 
    float dotProduct(float *A, float *B, int size) { 
     float curr = 0; 

     for (int i=0;i<size;i++) 
      curr += A[i] * B[i]; 

     return curr; 
    } 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int i=0;i<size-1;i++) 
      for (int j=i+1;j<size;j++) 
       std::swap(B[i * size + j], B[j * size + i]); 

     for (int col=0;col<size;col++) 
      for (int row=0;row<size;row++) 
       out[row * size + col] = dotProduct(&A[row], &B[row], size); // uses function above for dot product. 
    } 
#elif defined(FAST) 

#elif defined(FASTER) 

#endif

來源

2013-11-24 Kevin Melkowski

按照documentation cachegrind只有模擬第一和最後一級緩存：

Cachegrind模擬程序如何與一臺機器的緩存層次和（可選）分支預測器進行交互。它模擬一臺機器，它具有獨立的第一級指令和數據高速緩存（I1和D1），，由統一的二級高速緩存（L2）支持。這與許多現代機器的配置完全匹配。

但是，一些現代機器有三級或四級緩存。對於這些機器（Cachegrind可以自動檢測緩存配置的）Cachegrind模擬一級緩存和最後一級緩存。這種選擇的原因在於最後一級緩存對運行時影響最大，因爲它掩蓋了對主存儲器的訪問。此外，L1高速緩存通常具有低關聯性，因此模擬它們可以檢測代碼與該高速緩存非常不利地交互的情況（例如，以行長度爲列的方式遍歷矩陣爲2的冪）。

這意味着你不能獲得L2信息，但只有L1和L3在你的情況。

cachegrind輸出的第一部分報告有關L1指令緩存的信息。在你所有的例子中，L1指令緩存未命中的數量是微不足道的，錯過率總是0％。這意味着您的所有程序都適合您的L1指令緩存。

輸出的第二部分報告有關L1和LL（最後一級緩存，您的案例中的L3）數據緩存的信息。使用D1命中率：信息，您應該看到您的矩陣乘法算法的版本是「最高速高效」

cachegrind輸出summs上漲約LL信息的最後一部分（最後一級緩存，L3在你的情況）用於指令和數據。因此它提供了內存訪問的數量和緩存服務的內存請求的百分比。

來源

2014-01-13 07:57:54

如何解釋緩存未命中的cachegrind輸出？

回答

相關問題