2013-11-24 28 views
12

出於好奇,我跑了幾個不同版本的矩陣乘法編碼,並運行cachegrind。在下面的結果中,我想知道哪些部分是L1,L2,L3缺失和參考,以及它們的真正含義?下面是我的矩陣乘法的代碼,以防萬一需要。如何解釋緩存未命中的cachegrind輸出?

#define SLOWEST 
==6933== Cachegrind, a cache and branch-prediction profiler 
==6933== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==6933== Command: ./a.out 500 
==6933== 
--6933-- warning: L3 cache found, using its data for the LL simulation. 
--6933-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 60.7487 seconds. 
==6933== 
==6933== I refs:  6,039,791,314 
==6933== I1 misses:   1,611 
==6933== LLi misses:   1,519 
==6933== I1 miss rate:   0.00% 
==6933== LLi miss rate:   0.00% 
==6933== 
==6933== D refs:  2,892,704,678 (2,763,005,485 rd + 129,699,193 wr) 
==6933== D1 misses:  136,223,560 ( 136,174,705 rd +  48,855 wr) 
==6933== LLd misses:   53,675 (  5,247 rd +  48,428 wr) 
==6933== D1 miss rate:   4.7% (   4.9%  +   0.0% ) 
==6933== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==6933== 
==6933== LL refs:   136,225,171 ( 136,176,316 rd +  48,855 wr) 
==6933== LL misses:   55,194 (  6,766 rd +  48,428 wr) 
==6933== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define SLOWER 
==8463== Cachegrind, a cache and branch-prediction profiler 
==8463== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==8463== Command: ./a.out 500 
==8463== 
--8463-- warning: L3 cache found, using its data for the LL simulation. 
--8463-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 49.7397 seconds. 
==8463== 
==8463== I refs:  4,537,213,120 
==8463== I1 misses:   1,571 
==8463== LLi misses:   1,487 
==8463== I1 miss rate:   0.00% 
==8463== LLi miss rate:   0.00% 
==8463== 
==8463== D refs:  2,891,485,608 (2,761,862,312 rd + 129,623,296 wr) 
==8463== D1 misses:  59,961,522 ( 59,913,256 rd +  48,266 wr) 
==8463== LLd misses:   53,113 (  5,246 rd +  47,867 wr) 
==8463== D1 miss rate:   2.0% (   2.1%  +   0.0% ) 
==8463== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==8463== 
==8463== LL refs:   59,963,093 ( 59,914,827 rd +  48,266 wr) 
==8463== LL misses:   54,600 (  6,733 rd +  47,867 wr) 
==8463== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define SLOW 
==9174== Cachegrind, a cache and branch-prediction profiler 
==9174== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==9174== Command: ./a.out 500 
==9174== 
--9174-- warning: L3 cache found, using its data for the LL simulation. 
--9174-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 35.8901 seconds. 
==9174== 
==9174== I refs:  3,039,713,059 
==9174== I1 misses:   1,570 
==9174== LLi misses:   1,486 
==9174== I1 miss rate:   0.00% 
==9174== LLi miss rate:   0.00% 
==9174== 
==9174== D refs:  1,893,235,586 (1,763,112,301 rd + 130,123,285 wr) 
==9174== D1 misses:  63,285,950 ( 62,987,684 rd +  298,266 wr) 
==9174== LLd misses:   53,113 (  5,246 rd +  47,867 wr) 
==9174== D1 miss rate:   3.3% (   3.5%  +   0.2% ) 
==9174== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==9174== 
==9174== LL refs:   63,287,520 ( 62,989,254 rd +  298,266 wr) 
==9174== LL misses:   54,599 (  6,732 rd +  47,867 wr) 
==9174== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define MEDIUM 
==7838== Cachegrind, a cache and branch-prediction profiler 
==7838== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==7838== Command: ./a.out 500 
==7838== 
--7838-- warning: L3 cache found, using its data for the LL simulation. 
--7838-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 23.4097 seconds. 
==7838== 
==7838== I refs:  2,548,967,151 
==7838== I1 misses:   1,610 
==7838== LLi misses:   1,522 
==7838== I1 miss rate:   0.00% 
==7838== LLi miss rate:   0.00% 
==7838== 
==7838== D refs:  1,399,237,303 (1,267,363,440 rd + 131,873,863 wr) 
==7838== D1 misses:   592,807 (  293,091 rd +  299,716 wr) 
==7838== LLd misses:   53,147 (  5,248 rd +  47,899 wr) 
==7838== D1 miss rate:   0.0% (   0.0%  +   0.2% ) 
==7838== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==7838== 
==7838== LL refs:    594,417 (  294,701 rd +  299,716 wr) 
==7838== LL misses:   54,669 (  6,770 rd +  47,899 wr) 
==7838== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define MEDIUMISH 
==8438== Cachegrind, a cache and branch-prediction profiler 
==8438== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==8438== Command: ./a.out 500 
==8438== 
--8438-- warning: L3 cache found, using its data for the LL simulation. 
--8438-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 24.0327 seconds. 
==8438== 
==8438== I refs:  2,550,211,553 
==8438== I1 misses:   1,576 
==8438== LLi misses:   1,488 
==8438== I1 miss rate:   0.00% 
==8438== LLi miss rate:   0.00% 
==8438== 
==8438== D refs:  1,400,107,343 (1,267,610,303 rd + 132,497,040 wr) 
==8438== D1 misses:   339,977 (  42,583 rd +  297,394 wr) 
==8438== LLd misses:   53,114 (  5,248 rd +  47,866 wr) 
==8438== D1 miss rate:   0.0% (   0.0%  +   0.2% ) 
==8438== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==8438== 
==8438== LL refs:    341,553 (  44,159 rd +  297,394 wr) 
==8438== LL misses:   54,602 (  6,736 rd +  47,866 wr) 
==8438== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

矩陣乘法碼。

#if defined(SLOWEST) 
    void multiply (float **A, float **B, float **out, int size) { 
     for (int row=0;row<size;row++) 
      for (int col=0;col<size;col++) 
       for (int in=0;in<size;in++) 
        out[row][col] += A[row][in] * B[in][col]; 
    } 
// Takes in 1-D arrays, same as before. 
#elif defined(SLOWER) 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int row=0;row<size;row++) 
      for (int col=0;col<size;col++) 
       for (int in=0;in<size;in++) 
        out[row * size + col] += A[row * size + in] * B[in * size + col]; 
    } 
// Flips first and second loops 
#elif defined(SLOW) 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int col=0;col<size;col++) 
      for (int row=0;row<size;row++) { 
       float curr = 0; // prevents from calculating position each time through 
       for (int in=0;in<size;in++) 
        curr += A[row * size + in] * B[in *size + col]; 
       out[row * size + col] = curr; 
      } 
    } 
#elif defined(MEDIUM) 
    // Keeps it organized for future codes. 
    float dotProduct(float *A, float *B, int size) { 
     float curr = 0; 

     for (int i=0;i<size;i++) 
      curr += A[i] * B[i]; 

     return curr; 
    } 
    void multiply (float *A, float *B, float *out, int size) { 
     float *temp = new float[size]; 

     for (int col=0;col<size;col++) { 
      for (int i=0;i<size;i++) // stores column into sequential array 
       temp[i] = B[i * size + col]; 
      for (int row=0;row<size;row++) 
       out[row * size + col] = dotProduct(&A[row], temp, size); // uses function above for dot product. 
     } 

     delete[] temp; 
    } 
#elif defined(MEDIUMISH) 
    float dotProduct(float *A, float *B, int size) { 
     float curr = 0; 

     for (int i=0;i<size;i++) 
      curr += A[i] * B[i]; 

     return curr; 
    } 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int i=0;i<size-1;i++) 
      for (int j=i+1;j<size;j++) 
       std::swap(B[i * size + j], B[j * size + i]); 

     for (int col=0;col<size;col++) 
      for (int row=0;row<size;row++) 
       out[row * size + col] = dotProduct(&A[row], &B[row], size); // uses function above for dot product. 
    } 
#elif defined(FAST) 

#elif defined(FASTER) 

#endif 

回答

16

按照documentation cachegrind只有模擬第一和最後一級緩存:

Cachegrind模擬程序如何與一臺機器的緩存 層次和(可選)分支預測器進行交互。它模擬一臺機器 ,它具有獨立的第一級指令和數據高速緩存(I1和D1), ,由統一的二級高速緩存(L2)支持。這與許多現代機器的 配置完全匹配。

但是,一些現代機器有三級或四級緩存。對於 這些機器(Cachegrind可以自動檢測緩存配置的 )Cachegrind模擬一級緩存和 最後一級緩存。這種選擇的原因在於最後一級緩存對運行時影響最大,因爲它掩蓋了對主存儲器的訪問。此外,L1高速緩存通常具有低關聯性,因此模擬它們可以檢測代碼與該高速緩存非常不利地交互的情況(例如,以行長度 爲列的方式遍歷矩陣爲2的冪)。

這意味着你不能獲得L2信息,但只有L1和L3在你的情況。

cachegrind輸出的第一部分報告有關L1指令緩存的信息。在你所有的例子中,L1指令緩存未命中的數量是微不足道的,錯過率總是0%。這意味着您的所有程序都適合您的L1指令緩存。

輸出的第二部分報告有關L1和LL(最後一級緩存,您的案例中的L3)數據緩存的信息。使用D1命中率:信息,您應該看到您的矩陣乘法算法的版本是「最高速高效」

cachegrind輸出summs上漲約LL信息的最後一部分(最後一級緩存,L3在你的情況)用於指令和數據。因此它提供了內存訪問的數量和緩存服務的內存請求的百分比。