gcc內嵌程序集中的PC相對跳轉

我有一個asm循環，保證不會超過128次迭代，我希望通過PC相對跳轉來展開。這個想法是以相反的順序展開每個迭代，然後跳轉到需要的循環中。代碼看起來像這樣：gcc內嵌程序集中的PC相對跳轉

#define __mul(i) \ 
    "movq -"#i"(%3,%5,8),%%rax;" \ 
    "mulq "#i"(%4,%6,8);" \ 
    "addq %%rax,%0;" \ 
    "adcq %%rdx,%1;" \ 
    "adcq $0,%2;" 

asm("jmp (128-count)*size_of_one_iteration" // I need to figure this jump out 
    __mul(127) 
    __mul(126) 
    __mul(125) 
    ... 
    __mul(1) 
    __mul(0) 
    : "+r"(lo),"+r"(hi),"+r"(overflow) 
    : "r"(a.data),"r"(b.data),"r"(i-k),"r"(k) 
    : "%rax","%rdx");

是這樣的可能與gcc內聯彙編？

來源

2011-02-05 Chris

在gcc內聯彙編中，可以使用標籤並讓彙編器爲您挑選跳轉目標。類似（做作的例子）：

int max(int a, int b) 
{ 
    int result; 
    __asm__ __volatile__(
     "movl %1, %0\n" 
     "cmpl %2, %0\n" 
     "jeq a_is_larger\n" 
     "movl %2, %0\n" 
     "a_is_larger:\n" : "=r"(result), "r"(a), "r"(b)); 
    return (result); 
}

這是一回事。你可以做的另一件事情是避免乘法，就是讓彙編程序爲你的塊對齊，比如說，以32字節的倍數（我認爲指令序列不適合16字節），比如：

#define mul(i)      \ 
    ".align 32\n"     \ 
    ".Lmul" #i ":\n"    \ 
    "movq -" #i "(%3,%5,8),%%rax\n"\ 
    "mulq " #i "(%4,%6,8)\n"  \ 
    "addq %%rax,%0\n"    \ 
    "adcq %%rdx,%1\n"    \ 
    "adcq $0,%2\n"

這將簡單地填充指令流nop。如果喲做選擇不對齊這些塊，你仍然可以在你的主要表現，使用生成的本地標籤，弄清楚組裝塊的大小：

#ifdef UNALIGNED 
__asm__ ("imul $(.Lmul0-.Lmul1), %[label]\n" 
#else 
__asm__ ("shlq $5, %[label]\n" 
#endif 
    "leaq .Lmulblkstart, %[dummy]\n"  /* this is PC-relative in 64bit */ 
    "jmp (%[dummy], %[label])\n" 
    ".align 32\n" 
    ".Lmulblkstart:\n" 
    __mul(127) 
    ... 
    __mul(0) 
    : ... [dummy]"=r"(dummy) : [label]"r"((128-count)))

而對於情況count是編譯期時間常數，你甚至可以這樣做：

__asm__("jmp .Lmul" #count "\n" ...);

稍微注意一下就完了：

對齊塊是一個好主意，如果自動生成_mul()東西可以創造出不同長度的序列。對於使用的常量0..127，情況並非如此，因爲它們都適合一個字節，但是如果您將它們放大一點，它將轉換爲16位或32位值，並且指令塊會一起增長。通過填充指令流，可跳躍技術仍然可以使用。

來源

2011-02-18 16:34:36

對不起，我無法在ATT語法中提供答案，我希望您可以輕鬆地執行翻譯。

如果您在RCX計數，你可以有一個標籤剛過__mul（0），那麼你可以這樣做：

; rcx must be in [0..128] range. 
    imul ecx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU) 
    lea rcx, [rcx + the_label] ; There is no memory read here 
    jmp rcx

希望這有助於。編輯：昨天我犯了一個錯誤。我已經假定在[rcx + the_label]中引用標籤被解析爲[rcx + rip + disp]，但它並不是因爲沒有這種尋址模式（只存在[rip + disp32]）

此代碼應該工作，另外它會留下RCX不變，並會破壞RAX和RDX代替（但你的代碼似乎先寫他們之前沒有讀取它們）：

; rcx must be in [0..128] range. 
    imul edx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU) 
    lea rax, [the_label] ; PC-relative addressing (There is no memory read here) 
    add rax, rdx 
    jmp rax

來源

2011-02-05 23:34:12 LocoDelAssembly

這不是一個直接的答案，但你考慮使用變體 Duff's Device而不是串聯組件？這將採用switch語句的形式：

switch(iterations) { 
    case 128: /* code for i=128 here */ 
    case 127: /* code for i=127 here */ 
    case 126: /* code for i=126 here */ 
    /* ... */ 
    case 1: /* code for i=1 here*/ 
    break; 
    default: die("too many cases"); 
}

來源

2011-02-06 01:36:02 nelhage

我現在使用的是Duff設備的變體，但是我發佈了這個，因爲我想切換到只有asm的方式 – Chris 2011-02-07 19:59:35

gcc內嵌程序集中的PC相對跳轉

回答

相關問題