我正在研究使用簡單的微基準測試的緩存效果。

我認爲，如果N大於緩存大小，那麼緩存在每個第一讀取緩存行中都會有一個丟失操作。

在我的機器中，緩存行大小= 64Byte，所以我認爲完全緩存發生了N/8個未命中操作，緩存研究表明這一點。

但是，perf工具顯示不同的結果。它只發生34,265個緩存未命中操作。

我很懷疑硬件預取，因此請在BIOS中關閉此功能。無論如何，結果是一樣的。

我真的不知道爲什麼perf工具的緩存未命中會比「cachegrind」發生非常小的操作。有人可以給我一個合理的解釋嗎？我不明白cachegrind與perf工具之間的cache miss count

1.下面是一個簡單的微基準測試程序。

#include <stdio.h> 
    #define N 10000000 

    double A[N]; 

    int main(){ 

    int i; 
    double temp=0.0; 

    for (i=0 ; i<N ; i++){ 
     temp = A[i]*A[i]; 
    } 

    return 0; 
}

2.下面的結果cachegrind的輸出：

#> sudo perf stat -r 10 -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches ./test 

    ==27612== Cachegrind, a cache and branch-prediction profiler 
    ==27612== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. 
    ==27612== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info 
    ==27612== Command: ./test 
    ==27612== 
    --27612-- warning: L3 cache found, using its data for the LL simulation. 
    ==27612== 
    ==27612== I refs:  110,102,998 
    ==27612== I1 misses:   728 
    ==27612== LLi misses:   720 
    ==27612== I1 miss rate:  0.00% 
    ==27612== LLi miss rate:  0.00% 
    ==27612== 
    ==27612== D refs:  70,038,455 (60,026,965 rd + 10,011,490 wr) 
    ==27612== D1 misses:  1,251,802 (1,251,288 rd +  514 wr) 
    ==27612== LLd misses:  1,251,624 (1,251,137 rd +  487 wr) 
    ==27612== D1 miss rate:   1.7% (  2.0%  +  0.0% ) 
    ==27612== LLd miss rate:   1.7% (  2.0%  +  0.0% ) 
    ==27612== 
    ==27612== LL refs:   1,252,530 (1,252,016 rd +  514 wr) 
    ==27612== LL misses:  1,252,344 (1,251,857 rd +  487 wr) 
    ==27612== LL miss rate:   0.6% (  0.7%  +  0.0% ) 

    Generate a report File 
    -------------------------------------------------------------------------------- 
    I1 cache:   32768 B, 64 B, 4-way associative 
    D1 cache:   32768 B, 64 B, 8-way associative 
    LL cache:   8388608 B, 64 B, 16-way associative 
    Command:   ./test 
    Data file:  cache_block 
    Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 
    Events shown:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 
    Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 
    Thresholds:  0.1 100 100 100 100 100 100 100 100 
    Include dirs:  
    User annotated: /home/jin/1_dev/99_test/OI/test.s 
    Auto-annotation: off 

-------------------------------------------------------------------------------- 
     Ir I1mr ILmr   Dr  D1mr  DLmr   Dw D1mw DLmw 
-------------------------------------------------------------------------------- 
110,102,998 728 720 60,026,965 1,251,288 1,251,137 10,011,490 514 487 PROGRAM TOTALS 

-------------------------------------------------------------------------------- 
     Ir I1mr ILmr   Dr  D1mr  DLmr   Dw D1mw DLmw   file:function 
-------------------------------------------------------------------------------- 
110,000,011 1 1 60,000,003 1,250,000 1,250,000 10,000,003 0 0 /home/jin/1_dev/99_test/OI/test.s:main 

-------------------------------------------------------------------------------- 
-- User-annotated source: /home/jin/1_dev/99_test/OI/test.s 
-------------------------------------------------------------------------------- 
     Ir I1mr ILmr   Dr  D1mr  DLmr   Dw D1mw DLmw 

-- line 2 ---------------------------------------- 
     . . .   .   .   .   . . .   .comm A,80000000,32 
     . . .   .   .   .   . . . .comm B,80000000,32 
     . . .   .   .   .   . . . .text 
     . . .   .   .   .   . . . .globl main 
     . . .   .   .   .   . . . .type main, @function 
     . . .   .   .   .   . . . main: 
     . . .   .   .   .   . . . .LFB0: 
     . . .   .   .   .   . . . .cfi_startproc 
     1 0 0   0   0   0   1 0 0 pushq %rbp 
     . . .   .   .   .   . . . .cfi_def_cfa_offset 16 
     . . .   .   .   .   . . . .cfi_offset 6, -16 
     1 0 0   0   0   0   0 0 0 movq %rsp, %rbp 
     . . .   .   .   .   . . . .cfi_def_cfa_register 6 
     1 0 0   0   0   0   0 0 0 movl $0, %eax 
     1 1 1   0   0   0   1 0 0 movq %rax, -16(%rbp) 
     1 0 0   0   0   0   1 0 0 movl $0, -4(%rbp) 
     1 0 0   0   0   0   0 0 0 jmp .L2 
     . . .   .   .   .   . . . .L3: 
10,000,000 0 0 10,000,000   0   0   0 0 0 movl -4(%rbp), %eax 
10,000,000 0 0   0   0   0   0 0 0 cltq 
10,000,000 0 0 10,000,000 1,250,000 1,250,000   0 0 0 movsd A(,%rax,8), %xmm1 
10,000,000 0 0 10,000,000   0   0   0 0 0 movl -4(%rbp), %eax 
10,000,000 0 0   0   0   0   0 0 0 cltq 
10,000,000 0 0 10,000,000   0   0   0 0 0 movsd A(,%rax,8), %xmm0 
10,000,000 0 0   0   0   0   0 0 0 mulsd %xmm1, %xmm0 
10,000,000 0 0   0   0   0 10,000,000 0 0 movsd %xmm0, -16(%rbp) 
10,000,000 0 0 10,000,000   0   0   0 0 0 addl $1, -4(%rbp) 
     . . .   .   .   .   . . . .L2: 
10,000,001 0 0 10,000,001   0   0   0 0 0 cmpl $9999999, -4(%rbp) 
10,000,001 0 0   0   0   0   0 0 0 jle .L3 
     1 0 0   0   0   0   0 0 0 movl $0, %eax 
     1 0 0   1   0   0   0 0 0 popq %rbp 
     . . .   .   .   .   . . . .cfi_def_cfa 7, 8 
     1 0 0   1   0   0   0 0 0 ret 
     . . .   .   .   .   . . . .cfi_endproc 
     . . .   .   .   .   . . . .LFE0: 
     . . .   .   .   .   . . . .size main, .-main 
     . . .   .   .   .   . . . .ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3" 
     . . .   .   .   .   . . . .section .note.GNU-stack,"",@progbits 

-------------------------------------------------------------------------------- 
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 
-------------------------------------------------------------------------------- 
100 0 0 100 100 100 100 0 0 percentage of events annotated

3.下面的結果PErF時的輸出：

性能計數器統計信息」。/test'（10次運行）：

113,898,951 instructions    # 0.00 insns per cycle   (+- 12.73%) [17.36%] 
     53,607 cache-references            (+- 12.92%) [29.23%] 
     1,483 cache-misses    # 2.767 % of all cache refs  (+- 26.66%) [39.84%] 
    48,612,823 L1-dcache-loads            (+- 4.58%) [50.45%] 
     34,256 L1-dcache-load-misses  # 0.07% of all L1-dcache hits (+- 18.94%) [54.38%] 
    14,992,686 L1-dcache-stores            (+- 4.90%) [52.58%] 
     1,980 L1-dcache-store-misses          (+- 6.36%) [61.83%] 
     1,154 LLC-loads              (+- 61.14%) [53.22%] 
      18 LLC-load-misses   # 1.60% of all LL-cache hits  (+- 16.26%) [10.87%] 
      0 LLC-prefetches            [ 0.00%] 

    0.037949840 seconds time elapsed           (+- 3.57%)

更多實驗結果（2014年5月13日）：

[email protected]:~/1_dev/99_test/OI$ sudo perf stat -r 10 -e instructions -e r53024e -e r53014e -e L1-dcache-loads -e L1-dcache-load-misses -e r500f0a -e r500109 ./test 

Performance counter stats for './test' (10 runs): 

    116,464,390 instructions    # 0.00 insns per cycle   (+- 2.67%) [67.43%] 
     5,994 r53024e <-- L1D hardware prefetch misses      (+- 21.74%) [70.92%] 
    1,387,214 r53014e <-- L1D hardware prefetch requests     (+- 2.37%) [75.61%] 
    61,667,802 L1-dcache-loads            (+- 1.27%) [78.12%] 
     26,297 L1-dcache-load-misses  # 0.04% of all L1-dcache hits (+- 48.92%) [43.24%] 
      0 r500f0a <-- LLC lines allocated         [56.71%] 
     41,545 r500109 <-- Number of LLC read misses      (+- 6.16%) [50.08%] 

    0.037080925 seconds time elapsed

在上述結果中，「L1D硬件預取請求」的數目似乎是D1未命中（1250000 ）在cachegrind上。

在我的結論中，如果內存訪問「流模式」，則啓用L1D預取功能。並且由於LLC未命中信息，我無法檢查內存中有多少字節的負載。

我的結論是否正確？

來源

2014-05-12 libertyjin

您可以添加您使用的命令行嗎？海灣合作委員會與其選項，性能統計... – amigadev

對於遲到的答案感到抱歉。命令行如下：＃> sudo perf stat -r 10 -e指令-e緩存引用-e緩存未命中-e L1-dcache -loading -e L1-dcache-load-misses -e L1-dcache-stores - e L1-dcache-store-misses -e LLC -loading -e LLC -load-misses -e LLC-prefetches ./test – libertyjin

你確定你的初始程序沒有被編譯器優化嗎？特別是，主循環可以繞過，因爲A [i] * A [i]沒有在迭代之間保存（如果你使用double數組作爲temp，這應該可以解決問題）。我懷疑編譯器正在優化你的微基準。 –

底線：您對預取的假設是正確的，但您的解決方法不是。首先，正如Carlo指出的那樣，這個循環通常會被任何編譯器優化。由於perf和cachegrind都顯示大約100M的指令退出，所以我猜你沒有編譯優化，這意味着行爲不太現實 - 例如，你的循環變量可能存儲在內存中而不是寄存器中，增加無意義的內存訪問和偏移高速緩存計數器。

現在，您的運行之間的區別在於，cachgrind只是一個緩存模擬器，它不會模擬預取，因此每次首次訪問某個線路都會按預期進行。另一方面，正如你所看到的，真正的CPU確實有硬件預取，所以每一行都是第一次從內存中取出，它是通過預取（由於簡單的流模式）完成的，而不是實際的需求負載。這就是爲什麼perf錯過了用普通計數器計算這些訪問的原因。

您可以看到，當啓用預取計數器時，您可能會看到大致相同的N/8預取（可能還有其他類型的訪問）。

禁用預取程序似乎是正確的，但是大多數CPU不提供太多的控制權。您沒有指定您所使用的處理器類型，但如果它是英特爾例如，你可以在這裏看到，只有L2預取由BIOS控制，而你的輸出顯示L1預取 - https://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

搜索查看您的CPU類型的手冊以瞭解哪些L1預取程序存在，並瞭解如何解決這些問題。通常一個簡單的步幅（大於單個緩存行）應該足以欺騙它們，但如果這不起作用，則需要將訪問模式更改爲更隨機。你可以隨機化一些排列的指數。

來源

2015-05-01 07:20:49 Leeor

我不明白cachegrind與perf工具之間的cache miss count

1.下面是一個簡單的微基準測試程序。

2.下面的結果cachegrind的輸出：

3.下面的結果PErF時的輸出：

更多實驗結果（2014年5月13日）：

回答

相關問題