2017-08-02 119 views
12

我編寫了一個小程序,用於測量循環中的時間(通過內聯Sparc彙編代碼片段)。在Sparc 32位上處理值> 2^32的整數

一切都是正確的,直到我設置迭代次數大約在4.0 + 9(2^32以上)以上。

下面的代碼片段:

#include <stdio.h> 
#include <sys/time.h> 
#include <unistd.h> 
#include <math.h> 
#include <stdint.h> 

int main (int argc, char *argv[]) 
{ 
    // For indices 
    int i; 
    // Set the number of executions 
    int nRunning = atoi(argv[1]); 
    // Set the sums 
    double avgSum = 0.0; 
    double stdSum = 0.0; 
    // Average of execution time 
    double averageRuntime = 0.0; 
    // Standard deviation of execution time 
    double deviationRuntime = 0.0; 

    // Init sum 
    unsigned long long int sum = 0; 
    // Number of iterations 
    unsigned long long int nLoop = 4000000000ULL; 
    //uint64_t nLoop = 4000000000; 

    // DEBUG 
    printf("sizeof(unsigned long long int) = %zu\n",sizeof(unsigned long long int)); 
    printf("sizeof(unsigned long int) = %zu\n",sizeof(unsigned long int)); 

    // Time intervals 
    struct timeval tv1, tv2; 
    double diff; 

    // Loop for multiple executions 
    for (i=0; i<nRunning; i++) 
    { 
    // Start time 
    gettimeofday (&tv1, NULL); 

    // Loop with Sparc assembly into C source 
    asm volatile ("clr %%g1\n\t" 
       "clr %%g2\n\t" 
       "mov %1, %%g1\n" // %1 = input parameter 
       "loop:\n\t" 
       "add %%g2, 1, %%g2\n\t" 
       "subcc %%g1, 1, %%g1\n\t" 
       "bne loop\n\t" 
       "nop\n\t" 
       "mov %%g2, %0\n" // %0 = output parameter 
       : "=r" (sum)  // output 
       : "r" (nLoop) // input 
       : "g1", "g2"); // clobbers 

    // End time 
    gettimeofday (&tv2, NULL); 

    // Compute runtime for loop 
    diff = (tv2.tv_sec - tv1.tv_sec) * 1000000ULL + (tv2.tv_usec - tv1.tv_usec); 

    // Summing diff time 
    avgSum += diff; 
    stdSum += (diff*diff); 

    // DEBUG 
    printf("diff = %e\n", diff); 
    printf("avgSum = %e\n", avgSum); 

    } 
    // Compute final averageRuntime 
    averageRuntime = avgSum/nRunning; 

    // Compute standard deviation 
    deviationRuntime = sqrt(stdSum/nRunning-averageRuntime*averageRuntime); 

    // Print results 
    printf("(Average Elapsed time, Standard deviation) = %e usec %e usec\n", averageRuntime, deviationRuntime); 
    // Print sum from assembly loop 
    printf("Sum = %llu\n", sum); 

例如,nLoop < 2^32,我得到正確的值diffavgSumstdSum。事實上,printf,與nLoop = 4.0e+9,得出:

sizeof(unsigned long long int) = 8 
sizeof(unsigned long int) = 4 
diff = 9.617167e+06 
avgSum = 9.617167e+06 
diff = 9.499878e+06 
avgSum = 1.911704e+07 
(Average Elapsed time, Standard deviation) = 9.558522e+06 usec 5.864450e+04 usec 
Sum = 4000000000 

的代碼被編譯在Debian Sparc 32 bits Etchgcc 4.1.2

不幸的是,如果我拿例如nLoop = 5.0e+9,我會得到測量時間小而不正確的值;這裏是在這種情況下,printf的輸出:

sizeof(unsigned long long int) = 8 
sizeof(unsigned long int) = 4 
diff = 5.800000e+01 
avgSum = 5.800000e+01 
diff = 4.000000e+00 
avgSum = 6.200000e+01 
(Average Elapsed time, Standard deviation) = 3.100000e+01 usec 2.700000e+01 usec 
Sum = 5000000000 

我不知道在哪裏的問題可能來自使用uint64_t但沒有成功,我做其他檢查。

也許問題是我用32位操作系統處理large integers (> 2^32)或者它可能是不支持8字節整數的程序集內聯代碼。

如果有人能夠給我一些線索來解決這個錯誤,

問候

更新1

@Andrew Henle的建議,我採取了同樣的代碼,但不是行內的Sparc彙編代碼片段,我只是放了一個簡單的循環。

下面是用簡單的迴路,其已得到nLoop = 5.0e+9(見行「unsigned long long int nLoop = 5000000000ULL;」的節目,所以上面的limit 2^32-1

#include <stdio.h> 
#include <stdlib.h> 
#include <sys/time.h> 
#include <unistd.h> 
#include <math.h> 
#include <stdint.h> 

int main (int argc, char *argv[]) 
{ 
    // For indices of nRunning 
    int i; 
    // For indices of nRunning 
    unsigned long long int j; 
    // Set the number of executions 
    int nRunning = atoi(argv[1]); 
    // Set the sums 
    unsigned long long int avgSum = 0; 
    unsigned long long int stdSum = 0; 
    // Average of execution time 
    double averageRuntime = 0.0; 
    // Standard deviation of execution time 
    double deviationRuntime = 0.0; 

    // Init sum 
    unsigned long long int sum; 
    // Number of iterations 
    unsigned long long int nLoop = 5000000000ULL; 

    // DEBUG 
    printf("sizeof(unsigned long long int) = %zu\n",sizeof(unsigned long long int)); 
    printf("sizeof(unsigned long int) = %zu\n",sizeof(unsigned long int)); 

    // Time intervals 
    struct timeval tv1, tv2; 
    unsigned long long int diff; 

    // Loop for multiple executions 
    for (i=0; i<nRunning; i++) 
    { 
    // Reset sum 
    sum = 0; 

    // Start time 
    gettimeofday (&tv1, NULL); 

    // Loop with Sparc assembly into C source 
    /* asm volatile ("clr %%g1\n\t" 
       "clr %%g2\n\t" 
       "mov %1, %%g1\n" // %1 = input parameter 
     "loop:\n\t" 
     "add %%g2, 1, %%g2\n\t" 
     "subcc %%g1, 1, %%g1\n\t" 
     "bne loop\n\t" 
     "nop\n\t" 
     "mov %%g2, %0\n" // %0 = output parameter 
     : "=r" (sum)  // output 
     : "r" (nLoop) // input 
     : "g1", "g2"); // clobbers 
    */ 

    // Classic loop 
    for (j=0; j<nLoop; j++) 
     sum ++; 

    // End time 
    gettimeofday (&tv2, NULL); 

    // Compute runtime for loop 
    diff = (unsigned long long int) ((tv2.tv_sec - tv1.tv_sec) * 1000000 + (tv2.tv_usec - tv1.tv_usec)); 

    // Summing diff time 
    avgSum += diff; 
    stdSum += (diff*diff); 

    // DEBUG 
    printf("diff = %llu\n", diff); 
    printf("avgSum = %llu\n", avgSum); 
    printf("stdSum = %llu\n", stdSum); 
    // Print sum from assembly loop 
    printf("Sum = %llu\n", sum); 

    } 
    // Compute final averageRuntime 
    averageRuntime = avgSum/nRunning; 

    // Compute standard deviation 
    deviationRuntime = sqrt(stdSum/nRunning-averageRuntime*averageRuntime); 

    // Print results 
    printf("(Average Elapsed time, Standard deviation) = %e usec %e usec\n", averageRuntime, deviationRuntime); 

    return 0; 

} 

此代碼段工作正常,即,可變sum打印爲 (見 「printf("Sum = %llu\n", sum)」):

Sum = 5000000000 

所以,問題來自於與Sparc的會議樓的版本

我懷疑,在該彙編代碼,行"mov %1, %%g1\n" // %1 = input parameter要差些存儲nLoop%g1 register(我認爲%g1是一個32位寄存器,因此不能存儲值以上2^32-1)。

然而,在該行的輸出中的參數(變量sum):

"mov %%g2, %0\n" // %0 = output parameter 

高於極限,因爲它是等於50億。

附上與大會環路版本之間並沒有它的Vimdiff可以:

figure

在左邊,程序彙編,就沒事了,不大會(只是一個簡單的循環,而不是

我提醒你我的問題是,對於nLoop> 2^32-1並且使用匯編循環,我會在執行結束時獲得有效的sum參數,但無效(太短)averagestandard deviation次(花費在循環中);這裏是的輸出示例:

sizeof(unsigned long long int) = 8 
sizeof(unsigned long int) = 4 
diff = 17 
avgSum = 17 
stdSum = 289 
Sum = 5000000000 
diff = 4 
avgSum = 21 
stdSum = 305 
Sum = 5000000000 
(Average Elapsed time, Standard deviation) = 1.000000e+01 usec 7.211103e+00 usec 

隨着服用nLoop = 4.0e+9,即nLoop = 4000000000ULL,是沒有問題的,時間值是有效的。

更新2:

我通過生成彙編代碼搜索更深入。與nLoop = 4000000000 (4.0e+9)版本下面是:

.file "loop-WITH-asm-inline-4-Billions.c" 
    .section ".rodata" 
    .align 8 
.LLC1: 
    .asciz "sizeof(unsigned long long int) = %zu\n" 
    .align 8 
.LLC2: 
    .asciz "sizeof(unsigned long int) = %zu\n" 
    .align 8 
.LLC3: 
    .asciz "diff = %llu\n" 
    .align 8 
.LLC4: 
    .asciz "avgSum = %llu\n" 
    .align 8 
.LLC5: 
    .asciz "stdSum = %llu\n" 
    .align 8 
.LLC6: 
    .asciz "Sum = %llu\n" 
    .global __udivdi3 
    .global __cmpdi2 
    .global __floatdidf 
    .align 8 
.LLC7: 
    .asciz "(Average Elapsed time, Standard deviation) = %e usec %e usec\n" 
    .align 8 
.LLC0: 
    .long 0 
    .long 0 
    .section ".text" 
    .align 4 
    .global main 
    .type main, #function 
    .proc 04 
main: 
    save %sp, -248, %sp 
    st %i0, [%fp+68] 
    st %i1, [%fp+72] 
    ld [%fp+72], %g1 
    add %g1, 4, %g1 
    ld [%g1], %g1 
    mov %g1, %o0 
    call atoi, 0 
    nop 
    mov %o0, %g1 
    st %g1, [%fp-68] 
    st %g0, [%fp-64] 
    st %g0, [%fp-60] 
    st %g0, [%fp-56] 
    st %g0, [%fp-52] 
    sethi %hi(.LLC0), %g1 
    or %g1, %lo(.LLC0), %g1 
    ldd [%g1], %f8 
    std %f8, [%fp-48] 
    sethi %hi(.LLC0), %g1 
    or %g1, %lo(.LLC0), %g1 
    ldd [%g1], %f8 
    std %f8, [%fp-40] 
    mov 0, %g2 
    sethi %hi(4000000000), %g3 
    std %g2, [%fp-24] 
    sethi %hi(.LLC1), %g1 
    or %g1, %lo(.LLC1), %o0 
    mov 8, %o1 
    call printf, 0 
    nop 
    sethi %hi(.LLC2), %g1 
    or %g1, %lo(.LLC2), %o0 
    mov 4, %o1 
    call printf, 0 
    nop 
    st %g0, [%fp-84] 
    b .LL2 
    nop 
.LL3: 
    st %g0, [%fp-32] 
    st %g0, [%fp-28] 
    add %fp, -92, %g1 
    mov %g1, %o0 
    mov 0, %o1 
    call gettimeofday, 0 
    nop 
    ldd [%fp-24], %o4 
    clr %g1 
    clr %g2 
    mov %o4, %g1 
loop: 
    add %g2, 1, %g2 
    subcc %g1, 1, %g1 
    bne loop 
    nop 
    mov %g2, %o4 

    std %o4, [%fp-32] 
    add %fp, -100, %g1 
    mov %g1, %o0 
    mov 0, %o1 
    call gettimeofday, 0 
    nop 
    ld [%fp-100], %g2 
    ld [%fp-92], %g1 
    sub %g2, %g1, %g2 
    sethi %hi(999424), %g1 
    or %g1, 576, %g1 
    smul %g2, %g1, %g3 
    ld [%fp-96], %g2 
    ld [%fp-88], %g1 
    sub %g2, %g1, %g1 
    add %g3, %g1, %g1 
    st %g1, [%fp-12] 
    sra %g1, 31, %g1 
    st %g1, [%fp-16] 
    ldd [%fp-64], %o4 
    ldd [%fp-16], %g2 
    addcc %o5, %g3, %g3 
    addx %o4, %g2, %g2 
    std %g2, [%fp-64] 
    ld [%fp-16], %g2 
    ld [%fp-12], %g1 
    smul %g2, %g1, %g4 
    ld [%fp-16], %g2 
    ld [%fp-12], %g1 
    smul %g2, %g1, %g1 
    add %g4, %g1, %g4 
    ld [%fp-12], %g2 
    ld [%fp-12], %g1 
    umul %g2, %g1, %g3 
    rd %y, %g2 
    add %g4, %g2, %g4 
    mov %g4, %g2 
    ldd [%fp-56], %o4 
    addcc %o5, %g3, %g3 
    addx %o4, %g2, %g2 
    std %g2, [%fp-56] 
    sethi %hi(.LLC3), %g1 
    or %g1, %lo(.LLC3), %o0 
    ld [%fp-16], %o1 
    ld [%fp-12], %o2 
    call printf, 0 
    nop 
    sethi %hi(.LLC4), %g1 
    or %g1, %lo(.LLC4), %o0 
    ld [%fp-64], %o1 
    ld [%fp-60], %o2 
    call printf, 0 
    nop 
    sethi %hi(.LLC5), %g1 
    or %g1, %lo(.LLC5), %o0 
    ld [%fp-56], %o1 
    ld [%fp-52], %o2 
    call printf, 0 
    nop 
    sethi %hi(.LLC6), %g1 
    or %g1, %lo(.LLC6), %o0 
    ld [%fp-32], %o1 
    ld [%fp-28], %o2 
    call printf, 0 
    nop 
    ld [%fp-84], %g1 
    add %g1, 1, %g1 
    st %g1, [%fp-84] 
.LL2: 
    ld [%fp-84], %g2 
    ld [%fp-68], %g1 
    cmp %g2, %g1 
    bl .LL3 
    nop 
    ld [%fp-68], %g1 
    sra %g1, 31, %g1 
    ld [%fp-68], %g3 
    mov %g1, %g2 
    ldd [%fp-64], %o0 
    mov %g2, %o2 
    mov %g3, %o3 
    call __udivdi3, 0 
    nop 
    mov %o0, %g2 
    mov %o1, %g3 
    std %g2, [%fp-136] 
    ldd [%fp-136], %o0 
    mov 0, %o2 
    mov 0, %o3 
    call __cmpdi2, 0 
    nop 
    mov %o0, %g1 
    cmp %g1, 1 
    bl .LL6 
    nop 
    ldd [%fp-136], %o0 
    call __floatdidf, 0 
    nop 
    std %f0, [%fp-144] 
    b .LL5 
    nop 
.LL6: 
    ldd [%fp-136], %o4 
    and %o4, 0, %g2 
    and %o5, 1, %g3 
    ld [%fp-136], %o5 
    sll %o5, 31, %g1 
    ld [%fp-132], %g4 
    srl %g4, 1, %o5 
    or %o5, %g1, %o5 
    ld [%fp-136], %g1 
    srl %g1, 1, %o4 
    or %g2, %o4, %g2 
    or %g3, %o5, %g3 
    mov %g2, %o0 
    mov %g3, %o1 
    call __floatdidf, 0 
    nop 
    std %f0, [%fp-144] 
    ldd [%fp-144], %f8 
    ldd [%fp-144], %f10 
    faddd %f8, %f10, %f8 
    std %f8, [%fp-144] 
.LL5: 
    ldd [%fp-144], %f8 
    std %f8, [%fp-48] 
    ld [%fp-68], %g1 
    sra %g1, 31, %g1 
    ld [%fp-68], %g3 
    mov %g1, %g2 
    ldd [%fp-56], %o0 
    mov %g2, %o2 
    mov %g3, %o3 
    call __udivdi3, 0 
    nop 
    mov %o0, %g2 
    mov %o1, %g3 
    std %g2, [%fp-128] 
    ldd [%fp-128], %o0 
    mov 0, %o2 
    mov 0, %o3 
    call __cmpdi2, 0 
    nop 
    mov %o0, %g1 
    cmp %g1, 1 
    bl .LL8 
    nop 
    ldd [%fp-128], %o0 
    call __floatdidf, 0 
    nop 
    std %f0, [%fp-120] 
    b .LL7 
    nop 
.LL8: 
    ldd [%fp-128], %o4 
    and %o4, 0, %g2 
    and %o5, 1, %g3 
    ld [%fp-128], %o5 
    sll %o5, 31, %g1 
    ld [%fp-124], %g4 
    srl %g4, 1, %o5 
    or %o5, %g1, %o5 
    ld [%fp-128], %g1 
    srl %g1, 1, %o4 
    or %g2, %o4, %g2 
    or %g3, %o5, %g3 
    mov %g2, %o0 
    mov %g3, %o1 
    call __floatdidf, 0 
    nop 
    std %f0, [%fp-120] 
    ldd [%fp-120], %f8 
    ldd [%fp-120], %f10 
    faddd %f8, %f10, %f8 
    std %f8, [%fp-120] 
.LL7: 
    ldd [%fp-48], %f8 
    ldd [%fp-48], %f10 
    fmuld %f8, %f10, %f8 
    ldd [%fp-120], %f10 
    fsubd %f10, %f8, %f8 
    std %f8, [%fp-112] 
    ldd [%fp-112], %f8 
    fsqrtd %f8, %f8 
    std %f8, [%fp-152] 
    ldd [%fp-152], %f10 
    ldd [%fp-152], %f8 
    fcmpd %f10, %f8 
    nop 
    fbe .LL9 
    nop 
    ldd [%fp-112], %o0 
    call sqrt, 0 
    nop 
    std %f0, [%fp-152] 
.LL9: 
    ldd [%fp-152], %f8 
    std %f8, [%fp-40] 
    sethi %hi(.LLC7), %g1 
    or %g1, %lo(.LLC7), %o0 
    ld [%fp-48], %o1 
    ld [%fp-44], %o2 
    ld [%fp-40], %o3 
    ld [%fp-36], %o4 
    call printf, 0 
    nop 
    mov 0, %g1 
    mov %g1, %i0 
    restore 
    jmp %o7+8 
    nop 
    .size main, .-main 
    .ident "GCC: (GNU) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)" 
    .section ".note.GNU-stack" 

當我生成彙編語言代碼版本nLoop = 5000000000 (5.0e+9),差異是在如下圖所示(與vimdiff):

vimdiff differences

的「塊4十億「版本:

mov  0, %g2                               
sethi %hi(4000000000), %g3 

被替換爲」5十億「版本由:

mov  1, %g2 
sethi %hi(705032192), %g3             
or  %g3, 512, %g3               

我可以看到,5.0+e9不能在32位進行編碼,由於指令

sethi %hi(705032192), %g3 

矛盾的是,當我編譯版本「5個十億」彙編代碼,所述輸出中參數sum計算得很好,即等於5 Billions,我無法解釋它。

歡迎任何幫助或評論,謝謝。

+3

您似乎在彙編代碼中訪問'sum',它是'unsigned long long'。當然,你必須調整你的asm代碼來匹配參數的大小和類型。您是否嘗試使用C代碼並讓編譯器工作?如果編譯器支持8個字節的整數值,它可以創建代碼來操作它們。 – Gerhardh

+0

@ Gerhardh-如果你看看printf的輸出結果,你可以看到'sum'的計算結果很好(第一個例子爲4.0e + 9,第二個爲5.0e + 9)。在這兩種情況下,'sum'被聲明爲'unsigned long long int'。我不明白爲什麼在裝配輸入參數中使用'nLoop> 2^32'的情況並非如此? – youpilat13

+1

您是否使用匯編代碼進行64位計算?您可能不需要 –

回答

0

很大程度上取決於什麼版本的sparc和你正在使用的ABI。如果您使用的是sparc v8或更早版本,則只有32位寄存器的32位模式。在這種情況下,當您嘗試將5000000000加載到32位寄存器中時,它會失敗並代之以加載5000000000 mod 2 (即705032704)。這似乎正在發生。

另一方面,如果你有一個以32位模式運行的64位sparc處理器(通常稱爲v8plus),那麼你可以使用64位寄存器,所以這是可行的。

+0

- @ Chris Dodd我從QEMU的映像中使用Debian Sparc 32位蝕刻(使用gcc 4.1.2)。 (該圖片可在https://people.debian.org/~aurel32/qemu/sparc/上找到)。根據你的說法,我必須創建一個Debian Sparc 64位而不是32位的QEMU映像。我在https://people.debian.org/~glaubitz/debian-cd/2017-03-24/上找到了Debian-9 Sparc 64位的iso映像。如何在MacOS 10.9.5下使用此ISO創建QEMU映像:我嘗試過:qemu-system-sparc -hda debian_stretch_sparc64.img debian-9.0-sparc64-NETINST-1.iso -boot d但是出現錯誤不生成「.img」文件 – youpilat13

+0

我也試過這樣做:qemu-system-sparc -hda debian_stretch_sparc64.img -cdrom debian-9.0-sparc64-NETINST-1.iso -boot d。當沒有光驅時如何避免「-cdrom」標誌? – youpilat13

0

你似乎對64位值

的一半是做32位操作從生成的代碼,這裏的地方nLoop雙負載到兩個%o4%o5(因爲它是64位long long值):

ldd [%fp-24], %o4 
    clr %g1 
    clr %g2 

然後你只是%o4工作:

mov %o4, %g1    ; <---- what about %o5???? 
loop: 
    add %g2, 1, %g2 
    subcc %g1, 1, %g1 
    bne loop 
    nop 
    mov %g2, %o4 

要使這項工作重新編寫彙編代碼,將%o4 + %o5作爲一個64位值處理。