我在代碼中具有以下結構,並且已經使用了很多次。所以,爲了提高代碼的可讀性和減少行數,我真的需要使用宏來代替。其中我期待編寫一個宏爲它的部分如下:編寫一個C宏以在CUDA內核中使用
通過任意計算X#define _UNROLL_FACTOR_volIntGrad 32
int jj = 0;
for (; jj < (ngbSize - 32); jj += 32) {
int j = offset + jj;
#pragma unroll
for (int k = 0; k < 32; k++){
...
arbitrary calculation 1 (depends on k)
...
}
...
arbitrary calculation 2
...
}
for (; jj < (ngbSize - (_UNROLL_FACTOR_volIntGrad/2)); jj+= (_UNROLL_FACTOR_volIntGrad/2)){
int j = offset + jj;
#pragma unroll
for (int k = 0; k < 16; k++){
...
arbitrary calculation 1 (depends on k)
...
}
...
arbitrary calculation 2
...
}
for (; jj < (ngbSize - (_UNROLL_FACTOR_volIntGrad/4)); jj+= (_UNROLL_FACTOR_volIntGrad/4)){
int j = offset + jj;
#pragma unroll
for (int k = 0; k < 8; k++){
...
arbitrary calculation 1 (depends on k)
...
}
...
arbitrary calculation 2
...
}
for (; jj < (ngbSize - (_UNROLL_FACTOR_volIntGrad/8)); jj+= (_UNROLL_FACTOR_volIntGrad/8)){
int j = offset + jj;
#pragma unroll
for (int k = 0; k < 4; k++){
...
arbitrary calculation 1 (depends on k)
...
}
...
arbitrary calculation 2
...
}
for (; jj < (ngbSize - (_UNROLL_FACTOR_volIntGrad/16)); jj+= (_UNROLL_FACTOR_volIntGrad/16)){
int j = offset + jj;
#pragma unroll
for (int k = 0; k < 2; k++){
...
arbitrary calculation 1 (depends on k)
...
}
...
arbitrary calculation 2
...
}
for (; jj < ngbSize; jj++){
int j = offset + jj;
...
arbitrary calculation 3
...
}
}
,我的意思是一組計算的是獨立的宏觀和功能不同的功能。有誰知道如何編寫這個宏,以減少上述結構的大小?比如像下面這樣:
__MACRO
arbitrary calculation 1
arbitrary calculation 2
arbitrary calculation 3
__END
考慮之前的宏 – wasthishelpful
我相信你已經做測試的公平位設立在複雜度和降低該手動展開(和相應增加編寫一個函數在可讀性中)導致統計上顯着且有價值的性能益處? – EOF
@EOF完全正確! GPU內核的這部分實際上需要使編譯器知道循環的大小才能展開以提高性能。但易讀性會降低:-(。 – Siamak