-3
我正在使用1500核心和4GB RAM的亞馬遜K520 GPU上運行。我正在嘗試運行帶有1024 * 850線程的內核。我知道每塊最多隻能有1024個線程,但當我無法使用每塊1024個線程啓動超過255個塊(我遇到啓動錯誤)時,它讓我感到驚訝。我認爲網格大小的限制是2^16。當我運行一個空的內核時,它經過了很好的處理。這讓我覺得在某個地方沒有足夠的內存。我想知道是否可以解釋發生了什麼。謝謝。這裏是內核:Cuda無法運行超過1024 * 255線程
__global__ void dotSubCentroidNorm
(
Pt* segments,
int pointCount,
const Pt* centroids,
const int* segmentChanges,
float *dotResult
)
{
int idx = index();
if(idx>=pointCount)
return;
int segment = segments[idx].segmentIndex;
if(segment<0)
return;
int segPtCount = segmentChanges[segment+1]-segmentChanges[segment];
Pt &pt = segments[idx];
if(segPtCount==0)
{
printf("segment pt count =0 %d %d\n",idx, segment);
return;
}
const Pt &ctr = centroids[segment];
pt.x=pt.x-ctr.x/segPtCount;
pt.y=pt.y-ctr.y/segPtCount;
pt.z=pt.z-ctr.z/segPtCount;
dotResult[idx] = pt.x*pt.x;
dotResult[pointCount + idx] = pt.x*pt.y;
dotResult[pointCount*2 + idx] = pt.x*pt.z;
dotResult[pointCount*3 + idx] = pt.y*pt.y;
dotResult[pointCount*4 + idx] = pt.y*pt.z;
dotResult[pointCount*5 + idx] = pt.z*pt.z;
}
和結構:
struct Pt
{
float x,y,z;
int segmentIndex;
};
我打電話這個內核與約40萬鉑對段的數組,200鉑對質心,和200 segmentChanges,和400,000 * 6用於dotResult。這裏是呼叫:
....
thrust::device_vector<float> dotResult(pointCount*6);
printf("Errors1: %s \n",cudaGetErrorString(cudaGetLastError()));
int tpb = 1024; //threads per block
dim3 blocks = blkCnt(pointCount, tpb);
printf("blocks: %d %d\n", blocks.x, blocks.y);
dotSubCentroidNorm<<<blocks ,tpb>>>
(
segments,
pointCount,
thrust::raw_pointer_cast(centroids.data()),
segmentChanges,
thrust::raw_pointer_cast(dotResult.data())
);
printf("Errors2: %s \n",cudaGetErrorString(cudaGetLastError()));
cudaThreadSynchronize();
printf("Errors3: %s \n",cudaGetErrorString(cudaGetLastError()));
....
#define blkCnt(size, threadsPerBlock) dim3(min(255,(int)floor(1+(size)/(threadsPerBlock))),floor(1+(size)/(threadsPerBlock)/256))
#define index() (threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x))
....
問題是什麼? – Massa
當我啓動超過1024 * 255個內核線程時,爲什麼會出現錯誤? –
如果你提供一個簡短而完整的代碼,讓別人可以複製,編譯和運行,那會更好。是的,這需要您做一些工作,但它會提供更有效的幫助。什麼是您收到的確切的錯誤信息? 'pointCount'的價值是什麼? (是的,你提供了一堆數字,我不知道它們中哪一個實際上是'pointCount'。一個完整的代碼會使它變得明顯。)當你用'cuda-memcheck'運行你的代碼時會發生什麼? –