0
我還是OpenCL的新手,我用Nvidia的例子做了一些測試,整個程序由5個內核組成,這些內核按順序執行(1,2,3,4,5) 。OpenCL奇怪的內核行爲
第一個內核需要簡單的位置數據,速度數據,並應用重力和基本的碰撞檢測,然後調整該位置和速度...這個內核完美的作品沒有任何問題。
這裏是第一內核:
__kernel void integrate(
__global float4 *d_Pos, //input/output
__global float4 *d_Vel, //input/output
__constant simParams_t *params,
float deltaTime,
uint numParticles
){
const uint index = get_global_id(0);
if(index >= numParticles)
return;
float4 pos = d_Pos[index];
float4 vel = d_Vel[index];
pos.w = 1.0f;
vel.w = 0.0f;
//Gravity
vel += (float4)(params->gravity.x, params->gravity.y, params->gravity.z, 0) * deltaTime;
vel *= params->globalDamping;
//Advance pos
pos += vel * deltaTime;
//Collide with cube
if(pos.x < -1.0f + params->particleRadius){
pos.x = -1.0f + params->particleRadius;
vel.x *= params->boundaryDamping;
}
if(pos.x > 1.0f - params->particleRadius){
pos.x = 1.0f - params->particleRadius;
vel.x *= params->boundaryDamping;
}
if(pos.y < -1.0f + params->particleRadius){
pos.y = -1.0f + params->particleRadius;
vel.y *= params->boundaryDamping;
}
if(pos.y > 1.0f - params->particleRadius){
pos.y = 1.0f - params->particleRadius;
vel.y *= params->boundaryDamping;
}
if(pos.z < -1.0f + params->particleRadius){
pos.z = -1.0f + params->particleRadius;
vel.z *= params->boundaryDamping;
}
if(pos.z > 1.0f - params->particleRadius){
pos.z = 1.0f - params->particleRadius;
vel.z *= params->boundaryDamping;
}
//Store new position and velocity
d_Pos[index] = pos;
d_Vel[index] = vel;
}
第二內核正在這些位置作爲輸入,並輸出另一種數據的(某些指標),但它不改變位置數據。
第三內核正在做調整第二內核輸出(從其中不接觸位置數據的第二內核負責數據)。
現在的問題...第四內核;這需要位置數據和速度數據(從第一內核),需要從第三內核的調整數據,輸出另一個的位置和速度的數據(完全不同的指針爲這些位置和速度)
這裏是第四內核:
__kernel void findCellBoundsAndReorder(
__global uint *d_CellStart, //output: cell start index
__global uint *d_CellEnd, //output: cell end index
__global float4 *d_ReorderedPos, //output: reordered by cell hash positions
__global float4 *d_ReorderedVel, //output: reordered by cell hash velocities
__global const uint *d_Hash, //input: sorted grid hashes
__global const uint *d_Index, //input: particle indices sorted by hash
__global const float4 *d_Pos, //input: positions array sorted by hash
__global const float4 *d_Vel, //input: velocity array sorted by hash
__local uint *localHash, //get_group_size(0) + 1 elements
uint numParticles
){
uint hash;
const uint index = get_global_id(0);
//Handle case when no. of particles not multiple of block size
if(index < numParticles){
hash = d_Hash[index];
//Load hash data into local memory so that we can look
//at neighboring particle's hash value without loading
//two hash values per thread
localHash[get_local_id(0) + 1] = hash;
//First thread in block must load neighbor particle hash
if(index > 0 && get_local_id(0) == 0)
localHash[0] = d_Hash[index - 1];
}
barrier(CLK_LOCAL_MEM_FENCE);
if(index < numParticles){
//Border case
if(index == 0)
d_CellStart[hash] = 0;
//Main case
else{
if(hash != localHash[get_local_id(0)])
d_CellEnd[localHash[get_local_id(0)]] = d_CellStart[hash] = index;
};
//Another border case
if(index == numParticles - 1)
d_CellEnd[hash] = numParticles;
//Now use the sorted index to reorder the pos and vel arrays
uint sortedIndex = d_Index[index];
float4 pos = d_Pos[sortedIndex];
float4 vel = d_Vel[sortedIndex];
d_ReorderedPos[index] = pos;
d_ReorderedVel[index] = vel;
}
}
問題是,如果我單獨執行內核1(或1 + 2或1 + 2 + 3)位置並且速度從第一個內核正確調整。
但如果我執行內核1 + 2 + 3 + 4(雖然內核4不改變輸入數據),該數據保持不變(因爲如果我沒有執行什麼...職位不作調整) 。
您使用的是亂序指令隊列之後? – matthias
我不知道你在說什麼,因爲我還是新的:D 但我正在使用一個比普通的opencl調用更簡單的庫(libstdcl)......但我單獨調用每個內核(所以內核1完成執行,然後啓動內核2),因爲當我在隊列中做他們我得到錯誤(-5),我猜沒有足夠的內存問題 –
請添加調用這些內核的相關代碼(和您創建隊列的地方) –