2013-10-26 52 views
1

所以我有一個圖像陣列中1D最快做陣列填充:什麼是圖像陣列

a = {1,2,3,4,5,6,7,8,9} 

什麼是做陣列填充與zeoes包圍它擁有最快的方法:

0 0 0 0 0 
0 1 2 3 0 
0 4 5 6 0 
0 7 8 9 0 
0 0 0 0 0 

我已經聲明的b排列(這是一個填充陣列):

float *b = calloc(((data_size_X + 2)*(data_size_Y +2)), sizeof(float)); 
+4

你有理由相信,一組簡單的for循環不會是這裏夠快嗎? – templatetypedef

+0

但圖像陣列可以像兆像素一樣大... – Kiddo

+1

您是否嘗試過天真版,對其進行了剖析,並發現它太慢了?如今,考慮到處理器的速度和內存,一個百萬像素並不是很大,除非你在一個緊密的環路中這麼做,否則如果它令人難以置信地慢,我會感到驚訝。 – templatetypedef

回答

2

下面是一些基準。我的預感是正確的 - 使用memcpy比替代顯著快:

#include <stdio.h> 
#include <string.h> 
#include <stdlib.h> 
#include <time.h> 

int main(void) { 
    char* original; 
    char* padded; 
    long int n, m, ii, jj, kk; 
    time_t startT, stopT; 

    char *p1, *o1; // point to first element in row for padded, original 

    // pick a reasonably sized image: 
    n = 3000; 
    m = 2000; 

    // allocate memory: 
    original = malloc(m * n * sizeof(char)); 
    padded = calloc((m+2)*(n+2), sizeof(char)); 

    // put some random values in it: 
    for(ii = 0; ii < n*m; ii++) { 
    original[ii] = rand()%256; 
    } 

    // first attempt: completely naive loop 
    startT = clock(); 
    for(kk = 0; kk < 100; kk++) { 
    for(ii = 0; ii < m; ii++) { 
     for(jj = 0; jj < n; jj++) { 
     padded[(ii + 1) * (n + 2) + jj + 1] = original[ ii * n + jj]; 
     } 
    } 
    } 
    stopT = clock(); 
    printf("100 loops of 'really slow' took %.3f ms\n", (stopT - startT) * 1000.0/CLOCKS_PER_SEC); 

    // second attempt - pre-compute the index offset 
    startT = clock(); 
    for(kk = 0; kk < 100; kk++) { 
    for(ii = 0; ii < m; ii++) { 
     p1 = padded + (ii + 1) * (n + 2) + 1; 
     o1 = original + ii * n; 
     for(jj = 0; jj < n; jj++) { 
     p1[jj] = o1[jj]; 
     } 
    } 
    } 
    stopT = clock(); 
    printf("100 loops of 'not so fast' took %.3f ms\n", (stopT - startT) * 1000.0/CLOCKS_PER_SEC); 

    // third attempt: use memcpy to speed up the process  
    startT = clock(); 
    for(kk = 0; kk < 100; kk++) { 
    for(ii = 0; ii < m; ii++) { 
     p1 = padded + (ii + 1) * (n + 2) + 1; 
     o1 = original + ii * n; 
     memcpy(p1, o1, n); 
    } 
    } 
    stopT = clock(); 
    printf("100 loops of 'fast' took %.3f ms\n", (stopT - startT) * 1000.0/CLOCKS_PER_SEC); 

    free(original); 
    free(padded); 
    return 0; 
} 

下面是輸出結果:

100 loops of 'really slow' took 3020.585 ms 
100 loops of 'not so fast' took 3725.056 ms 
100 loops of 'fast' took 332.298 ms 

當我打開編譯器的優化與-O3,時序變化如下:

100 loops of 'really slow' took 2727.442 ms 
100 loops of 'not so fast' took 488.244 ms 
100 loops of 'fast' took 326.998 ms 

很明顯,編譯器「發現」了更乾淨的複製循環並試圖優化它 - 但它仍然沒有做以及memcpy。而且在memcpy中幾乎沒有什麼可以優化的。

+0

謝謝,memcpy的確有很好的效果:) – Kiddo

+0

當複製數組時,我還實現了塊大小,這是更快一點 – Kiddo

0

如果你像你描述,下面很可能會快於嵌套已經分配b for循環:

int aIndex; 
int maxA = data_size_X * data_size_Y; 
float * pb = b + data_size_X + 3; 
memset(b, 0, (data_size_X + 2) * (data_size_Y + 2) * sizeof(float)); 
for (aIndex = 0; aIndex < maxA; aIndex += data_sizeX) { 
    memcpy(pb, a + aIndex, data_size_X); 
    pb += (data_size_X + 2); 
} 
+0

是的,速度更快 - 請參閱我的基準測試。 – Floris

+0

@弗洛伊斯哇。而且你比我更快的編碼器!好的回答 - 高調。 – Turix

+0

只是爲了澄清,做會員沒有做calloc那麼快,對吧?因爲當我們調用calloc時,我們也將整個數組置零,並將該數組放在本地。 – Kiddo