我有代碼:解決衝突 - 嘗試合併GMEM訪問,使用SMEM,但銀行衝突
struct __declspec(align(32)) Circle
{
float x, y;
float prevX, prevY;
float speedX, speedY;
float mass;
float radius;
void init(const int _x, const int _y, const float _speedX = 0.0f, const float _speedY = 0.0f,
const float _radius = CIRCLE_RADIUS_DEFAULT,
const float _mass = CIRCLE_MASS_DEFAULT);
};
,第二個:
/*smem[threadIdx.x] = *(((float*)cOut) + threadIdx.x);
smem[threadIdx.x + blockDim.x] = *(((float*)cOut) + threadIdx.x + blockDim.x);
smem[threadIdx.x + blockDim.x * 2] = *(((float*)cOut) + threadIdx.x + blockDim.x * 2);
smem[threadIdx.x + blockDim.x * 3] = *(((float*)cOut) + threadIdx.x + blockDim.x * 3);
smem[threadIdx.x + blockDim.x * 4] = *(((float*)cOut) + threadIdx.x + blockDim.x * 4);
smem[threadIdx.x + blockDim.x * 5] = *(((float*)cOut) + threadIdx.x + blockDim.x * 5);
smem[threadIdx.x + blockDim.x * 6] = *(((float*)cOut) + threadIdx.x + blockDim.x * 6);
smem[threadIdx.x + blockDim.x * 7] = *(((float*)cOut) + threadIdx.x + blockDim.x * 7);*/
__syncthreads();
/*float x, y;
float prevX, prevY;
float speedX, speedY;
float mass;
float radius;*/
/*c.x = smem[threadIdx.x];
c.y = smem[threadIdx.x + blockDim.x]; //there must be [threadId.x * 8 + 0]
c.prevX = smem[threadIdx.x + blockDim.x * 2]; //[threadId.x * 8 + 1] and e.t.c.
c.prevY = smem[threadIdx.x + blockDim.x * 3];
c.speedX = smem[threadIdx.x + blockDim.x * 4];
c.speedY = smem[threadIdx.x + blockDim.x * 5];
c.mass = smem[threadIdx.x + blockDim.x * 6];
c.radius = smem[threadIdx.x + blockDim.x * 7];*/
c = cOut[j];
//c = *((Circle*)(smem + threadIdx * SMEM));
有2 GMEM(我的意思是全球內存)訪問: 1)讀取圓並檢測它與它的碰撞 2)在改變它的速度和位置後寫入圓 另外我還有circleConst-mass的Circle,它是由cudaMallocToSybol()分配的。它用於檢查從gmem讀取的主圓C(它在寄存器中)的圓的交集。
當我想到,我使用const,記憶好,它獲得了我所有的性能:')(我錯了)
當我讀到凝聚的訪問GMEM(有合併獲得其他類型的?記憶?我沒有找到任何有關它的信息),我想爲我嘗試它。如你所見,Circle-structure有8個變量,類型爲float = 32位。我嘗試過(在代碼中評論)做這件事,但是,首先,我得到了一個錯誤的答案(因爲我必須從不正確的地方讀到,下面會提到),其次,我的表現會降低33%。爲什麼?我認爲,這不取決於錯誤的領域關係。
而第二個問題,正如我在代碼的評論中寫到從smem到C的附近的代碼所寫的,我必須讀另一種方式,但如果我這樣做,會有很多銀行衝突,所以我會獲得更少的性能...... 那麼,我怎樣才能加載圈沒有銀行衝突coalasced,然後,把它寫回來?
p.s大小超過4 * float的結構是否位於寄存器中?
更新: 最新的版本是:
#define CF (9) //9 because the primary struct has 8 floats, so 1 is for wasting
i = blockIdx.x * blockDim.x;
smem[threadIdx.x + blockDim.x * 0 + blockDim.x * 0/(CF - 1) + threadIdx.x/(CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 0);
smem[threadIdx.x + blockDim.x * 1 + blockDim.x * 1/(CF - 1) + threadIdx.x/(CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 1);
smem[threadIdx.x + blockDim.x * 2 + blockDim.x * 2/(CF - 1) + threadIdx.x/(CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 2);
smem[threadIdx.x + blockDim.x * 3 + blockDim.x * 3/(CF - 1) + threadIdx.x/(CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 3);
smem[threadIdx.x + blockDim.x * 4 + blockDim.x * 4/(CF - 1) + threadIdx.x/(CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 4);
smem[threadIdx.x + blockDim.x * 5 + blockDim.x * 5/(CF - 1) + threadIdx.x/(CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 5);
smem[threadIdx.x + blockDim.x * 6 + blockDim.x * 6/(CF - 1) + threadIdx.x/(CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 6);
smem[threadIdx.x + blockDim.x * 7 + blockDim.x * 7/(CF - 1) + threadIdx.x/(CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 7);
c.x = smem[threadIdx.x * CF + 0];
c.y = smem[threadIdx.x * CF + 1];
c.prevX = smem[threadIdx.x * CF + 2];
c.prevY = smem[threadIdx.x * CF + 3];
c.speedX = smem[threadIdx.x * CF + 4];
c.speedY = smem[threadIdx.x * CF + 5];
c.mass = smem[threadIdx.x * CF + 6];
c.radius = smem[threadIdx.x * CF + 7];
是不是正確的方式使用存取權限合併SMEM GMEM?我的意思是,我害怕BlockDim.x * 1/(CF - 1) + threadIdx.x/(CF - 1)
。 我想,我沒有得到一些提升,因爲它不允許gmem合併讀取超過一個圓,但我不明白,如何使它合併兩個圓..
是的,你清楚地瞭解我。 我嘗試使用虛擬值,如你所說。我得到了正確的答案(正確的渲染),但我沒有得到任何加速或減速。這是因爲sizeof(Circle)== 32b嗎? (或者我做錯了嗎?我會在問題中添加新版本) 我的意思是,全局內存合併訪問的大小(正如我在文章中所讀到的)對於float(我的情況)是64b,對於float2是128b, 256b爲float3和float4? 我之前讀過關於SoA vs AoS模式,但在我的項目中,我無法使用SoA重新創建它(出於某些原因)。 – Nexen