2016-03-28 39 views
0

是否有工具或方法,告訴我代碼塊使用多少個時鐘週期?手動調試和計數對於較大的代碼塊是一種痛苦。確定代碼塊需要多少個時鐘週期

+3

在現代處理器(例如現代X86),這通常不是一個有意義的/有用的統計(由於亂序執行,存儲器檔,指令高速緩存,分支預測等) –

回答

2

在x86上,Intel's IACA (Intel Architecture Code Analyzer是我所知道的唯一的靜態分析器。它假定零緩存未命中,以及其他各種簡化,但有點用處。

我認爲它也假設除了最後一個分支以外的所有分支都沒有被採用,所以它對於帶分支的循環體可能沒有用處。

IACA在其數據中也存在一些錯誤,例如,它認爲shld在桑迪布里奇是緩慢的。它確實知道一些非顯而易見的事情,例如SnB-family CPUs can't micro-fuse 2-register addressing modes

自從Haswell更新以來,它基本上被廢棄了。 Skylake可以在比Haswell更多的執行端口上運行一些指令(請參閱Agner Fog's instruction tables),但該管道足夠相似,結果應該相當有用。另請參閱標記wiki上的其他鏈接,包括英特爾的優化手冊,以幫助您理解輸出。


我喜歡用這個iaca.sh包裝腳本,使-64默認(我可以用-32覆蓋)。我忘了我寫了多少(可能只是最後的if (($# >= 1))位)以及LD_LIBRARY_PATH部分來自哪裏。

iaca.sh

#!/bin/bash 
myname=$(realpath "$0") 
mypath=$(dirname "$myname") 
ld_lib="$LD_LIBRARY_PATH" 
app_loc="../lib" 

if [ "$LD_LIBRARY_PATH" = "" ] 
then 
export LD_LIBRARY_PATH="$mypath/$app_loc" 
else 
export LD_LIBRARY_PATH="$mypath/$app_loc:$LD_LIBRARY_PATH" 
fi 

if (($# >= 1));then 
    exec "$mypath/iaca" -64 "[email protected]" 
else 
    exec "$mypath/iaca" # there is no -help, just run with no args for help output 
fi 

例如:就地前綴資金,從SIMD prefix sum on Intel cpu

#include <immintrin.h> 

#ifdef IACA_MARKS_OFF 
    #define IACA_START 
    #define IACA_END 
#else 
    #include <iacaMarks.h> 
#endif 

// In-place rewrite an array of values into an array of prefix sums. 
// This makes the code simpler, and minimizes cache effects. 
int prefix_sum_sse(int data[], int n) 
{ 

// const int elemsz = sizeof(data[0]); 
#define elemsz sizeof(data[0]) // clang-3.5 doesn't allow const int foo = ... as an imm8 arg to intrinsics 

    __m128i *datavec = (__m128i*)data; 
    const int vec_elems = sizeof(*datavec)/elemsz; 
    // to use this for int8/16_t, you still need to change the add_epi32, and the shuffle 

    const __m128i *endp = (__m128i*) (data + n - 2*vec_elems); // pointer to last full vector we can load 
    __m128i carry = _mm_setzero_si128(); 
    for(; datavec <= endp ; datavec += 2) { 
     IACA_START 
     __m128i x0 = _mm_load_si128(datavec + 0); 
     __m128i x1 = _mm_load_si128(datavec + 1); // unroll/pipeline by 1 
//  __m128i x2 = _mm_load_si128(datavec + 2); 
//  __m128i x3; 

     x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, elemsz)); 
     x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, elemsz)); 

     x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, 2*elemsz)); 
     x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, 2*elemsz)); 

     // more shifting if vec_elems is larger 

     x0 = _mm_add_epi32(x0, carry); // this has to go after the byte-shifts, to avoid double-counting the carry. 
     _mm_store_si128(datavec +0, x0); // store first to allow destructive shuffle (e.g. non-avx shufps for FP or pshufb for narrow integers) 

     x1 = _mm_add_epi32(_mm_shuffle_epi32(x0, _MM_SHUFFLE(3,3,3,3)), x1); 
     _mm_store_si128(datavec +1, x1); 

     carry = _mm_shuffle_epi32(x1, _MM_SHUFFLE(3,3,3,3)); // broadcast the high element for next vector 
    } 
    // FIXME: scalar loop to handle the last few elements 
    IACA_END 
    return data[n-1]; 
    #undef elemsz 
} 

$ gcc -I/opt/iaca-2.1/include -Wall -O3 -c prefix-sum.c -march=nehalem -mtune=haswell 
$ iaca.sh prefix-sum.o 
Intel(R) Architecture Code Analyzer Version - 2.1 
Analyzed File - prefix-sum.o 
Binary Format - 64Bit 
Architecture - HSW 
Analysis Type - Throughput 

Throughput Analysis Report 
-------------------------- 
Block Throughput: 6.40 Cycles  Throughput Bottleneck: Port5 

Port Binding In Cycles Per Iteration: 
--------------------------------------------------------------------------------------- 
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | 
--------------------------------------------------------------------------------------- 
| Cycles | 1.0 0.0 | 5.7 | 1.4 1.0 | 1.4 1.0 | 2.0 | 6.3 | 1.0 | 1.3 | 
--------------------------------------------------------------------------------------- 

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) 
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path 
F - Macro Fusion with the previous instruction occurred 
* - instruction micro-ops not bound to a port 
^ - Micro Fusion happened 
# - ESP Tracking sync uop was issued 
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected 
! - instruction not supported, was not accounted in Analysis 

| Num Of |     Ports pressure in cycles      | | 
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | 
--------------------------------------------------------------------------------- 
| 1 |   |  | 1.0 1.0 |   |  |  |  |  | | movdqa xmm3, xmmword ptr [rax] 
| 1 | 1.0  |  |   |   |  |  |  |  | | add rax, 0x20 
| 1 |   |  |   | 1.0 1.0 |  |  |  |  | | movdqa xmm0, xmmword ptr [rax-0x10] 
| 0* |   |  |   |   |  |  |  |  | | movdqa xmm1, xmm3 
| 1 |   |  |   |   |  | 1.0 |  |  | CP | pslldq xmm1, 0x4 
| 1 |   | 1.0 |   |   |  |  |  |  | | paddd xmm1, xmm3 
| 0* |   |  |   |   |  |  |  |  | | movdqa xmm3, xmm0 
| 1 |   |  |   |   |  | 1.0 |  |  | CP | pslldq xmm3, 0x4 
| 0* |   |  |   |   |  |  |  |  | | movdqa xmm4, xmm1 
| 1 |   | 1.0 |   |   |  |  |  |  | | paddd xmm3, xmm0 
| 1 |   |  |   |   |  | 1.0 |  |  | CP | pslldq xmm4, 0x8 
| 0* |   |  |   |   |  |  |  |  | | movdqa xmm0, xmm3 
| 1 |   | 1.0 |   |   |  |  |  |  | | paddd xmm1, xmm4 
| 1 |   |  |   |   |  | 1.0 |  |  | CP | pslldq xmm0, 0x8 
| 1 |   | 1.0 |   |   |  |  |  |  | | paddd xmm1, xmm2 
| 1 |   | 0.8 |   |   |  | 0.2 |  |  | CP | paddd xmm0, xmm3 
| 2^ |   |  |   |   | 1.0 |  |  | 1.0 | | movaps xmmword ptr [rax-0x20], xmm1 
| 1 |   |  |   |   |  | 1.0 |  |  | CP | pshufd xmm1, xmm1, 0xff 
| 1 |   | 0.9 |   |   |  | 0.1 |  |  | CP | paddd xmm0, xmm1 
| 2^ |   |  | 0.3  | 0.3  | 1.0 |  |  | 0.3 | | movaps xmmword ptr [rax-0x10], xmm0 
| 1 |   |  |   |   |  | 1.0 |  |  | CP | pshufd xmm1, xmm0, 0xff 
| 0* |   |  |   |   |  |  |  |  | | movdqa xmm2, xmm1 
| 1 |   |  |   |   |  |  | 1.0 |  | | cmp rdx, rax 
| 0F |   |  |   |   |  |  |  |  | | jnb 0xffffffffffffff94 
Total Num Of Uops: 20 

注意,總UOP計數不能對於前端,ROB和4寬度問題/退休寬度的融合域uops。它計算unfused-domain uops,這對執行單元(和調度程序)很重要。這很愚蠢,因爲在非融合領域,它主要關乎uop需要的端口,而不是有多少。

這是不是最好的例子,因爲它是在平凡的Haswell洗牌端口上的瓶頸。不過,它確實顯示了IACA如何顯示mov消除,微融合存儲和宏融合比較分支。

微指令時有一個選擇端口之間的分佈是非常隨意的。不要期望它匹配真正的硬件。我不認爲IACA完全模仿ROB /調度器。以前的SO問題已經討論過這個和其他限制。嘗試搜索IACA,因爲它是一個相當獨特的字符串。