所以我有一個圖像陣列中1D最快做陣列填充:什麼是圖像陣列
a = {1,2,3,4,5,6,7,8,9}
什麼是做陣列填充與zeoes包圍它擁有最快的方法:
0 0 0 0 0
0 1 2 3 0
0 4 5 6 0
0 7 8 9 0
0 0 0 0 0
我已經聲明的b排列(這是一個填充陣列):
float *b = calloc(((data_size_X + 2)*(data_size_Y +2)), sizeof(float));
所以我有一個圖像陣列中1D最快做陣列填充:什麼是圖像陣列
a = {1,2,3,4,5,6,7,8,9}
什麼是做陣列填充與zeoes包圍它擁有最快的方法:
0 0 0 0 0
0 1 2 3 0
0 4 5 6 0
0 7 8 9 0
0 0 0 0 0
我已經聲明的b排列(這是一個填充陣列):
float *b = calloc(((data_size_X + 2)*(data_size_Y +2)), sizeof(float));
下面是一些基準。我的預感是正確的 - 使用memcpy
比替代顯著快:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
char* original;
char* padded;
long int n, m, ii, jj, kk;
time_t startT, stopT;
char *p1, *o1; // point to first element in row for padded, original
// pick a reasonably sized image:
n = 3000;
m = 2000;
// allocate memory:
original = malloc(m * n * sizeof(char));
padded = calloc((m+2)*(n+2), sizeof(char));
// put some random values in it:
for(ii = 0; ii < n*m; ii++) {
original[ii] = rand()%256;
}
// first attempt: completely naive loop
startT = clock();
for(kk = 0; kk < 100; kk++) {
for(ii = 0; ii < m; ii++) {
for(jj = 0; jj < n; jj++) {
padded[(ii + 1) * (n + 2) + jj + 1] = original[ ii * n + jj];
}
}
}
stopT = clock();
printf("100 loops of 'really slow' took %.3f ms\n", (stopT - startT) * 1000.0/CLOCKS_PER_SEC);
// second attempt - pre-compute the index offset
startT = clock();
for(kk = 0; kk < 100; kk++) {
for(ii = 0; ii < m; ii++) {
p1 = padded + (ii + 1) * (n + 2) + 1;
o1 = original + ii * n;
for(jj = 0; jj < n; jj++) {
p1[jj] = o1[jj];
}
}
}
stopT = clock();
printf("100 loops of 'not so fast' took %.3f ms\n", (stopT - startT) * 1000.0/CLOCKS_PER_SEC);
// third attempt: use memcpy to speed up the process
startT = clock();
for(kk = 0; kk < 100; kk++) {
for(ii = 0; ii < m; ii++) {
p1 = padded + (ii + 1) * (n + 2) + 1;
o1 = original + ii * n;
memcpy(p1, o1, n);
}
}
stopT = clock();
printf("100 loops of 'fast' took %.3f ms\n", (stopT - startT) * 1000.0/CLOCKS_PER_SEC);
free(original);
free(padded);
return 0;
}
下面是輸出結果:
100 loops of 'really slow' took 3020.585 ms
100 loops of 'not so fast' took 3725.056 ms
100 loops of 'fast' took 332.298 ms
當我打開編譯器的優化與-O3
,時序變化如下:
100 loops of 'really slow' took 2727.442 ms
100 loops of 'not so fast' took 488.244 ms
100 loops of 'fast' took 326.998 ms
很明顯,編譯器「發現」了更乾淨的複製循環並試圖優化它 - 但它仍然沒有做以及memcpy
。而且在memcpy中幾乎沒有什麼可以優化的。
如果你像你描述,下面很可能會快於嵌套已經分配b
for循環:
int aIndex;
int maxA = data_size_X * data_size_Y;
float * pb = b + data_size_X + 3;
memset(b, 0, (data_size_X + 2) * (data_size_Y + 2) * sizeof(float));
for (aIndex = 0; aIndex < maxA; aIndex += data_sizeX) {
memcpy(pb, a + aIndex, data_size_X);
pb += (data_size_X + 2);
}
你有理由相信,一組簡單的for循環不會是這裏夠快嗎? – templatetypedef
但圖像陣列可以像兆像素一樣大... – Kiddo
您是否嘗試過天真版,對其進行了剖析,並發現它太慢了?如今,考慮到處理器的速度和內存,一個百萬像素並不是很大,除非你在一個緊密的環路中這麼做,否則如果它令人難以置信地慢,我會感到驚訝。 – templatetypedef