2012-11-26 35 views
1

cuda profiler output:CUDA內核不能啓動CudaDeviceSynchronize

之前我有一些麻煩並行CUDA。看看附帶的圖像。內核在0.395秒的標記點處啓動。然後有一些綠色的CpuWork。最後,調用cudaDeviceSynchronize。在CpuWork之前啓動的內核在同步調用之前不啓動。理想情況下,它應該與CPU工作並行運行。

void KdTreeGpu::traceRaysOnGpuAsync(int firstRayIndex, int numRays, int rank, int buffer) 
{ 
    int per_block = 128; 
    int num_blocks = numRays/per_block + (numRays%per_block==0?0:1); 

    Ray* rays = &this->deviceRayPtr[firstRayIndex]; 
    int* outputHitPanelIds = &this->deviceHitPanelIdPtr[firstRayIndex]; 

    kdTreeTraversal<<<num_blocks, per_block, 0>>>(sceneBoundingBox, rays, deviceNodesPtr, deviceTrianglesListPtr, 
               firstRayIndex, numRays, rank, rootNodeIndex, 
               deviceTHitPtr, outputHitPanelIds, deviceReflectionPtr); 

    CUDA_VALIDATE(cudaMemcpyAsync(resultHitDistances[buffer], deviceTHitPtr, numRays*sizeof(double), cudaMemcpyDeviceToHost)); 
    CUDA_VALIDATE(cudaMemcpyAsync(resultHitPanelIds[buffer], outputHitPanelIds, numRays*sizeof(int), cudaMemcpyDeviceToHost)); 
    CUDA_VALIDATE(cudaMemcpyAsync(resultReflections[buffer], deviceReflectionPtr, numRays*sizeof(Vector3), cudaMemcpyDeviceToHost)); 
} 

該memcopies是異步的。結果緩衝區像這樣分配

unsigned int flag = cudaHostAllocPortable; 

CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[0], MAX_RAYS_PER_ITERATION*sizeof(int), flag)); 
CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[1], MAX_RAYS_PER_ITERATION*sizeof(int), flag)); 

希望能找到解決方案。已經嘗試了很多東西,包括沒有在默認流中運行。當我添加cudaHostAlloc我認識到異步方法返回到CPU。但是,當內核在稍後的deviceSynchronize調用之前不啓動時,這沒有幫助。

resultHitDistances[2]包含兩個分配的內存區域,以便當CPU讀取0時,GPU應該把結果在1

謝謝!

編輯:這是調用traceRaysAsync的代碼。

int numIterations = ceil(float(this->numPrimaryRays)/MAX_RAYS_PER_ITERATION); 
int numRaysPrevious = min(MAX_RAYS_PER_ITERATION, this->numPrimaryRays); 
nvtxRangePushA("traceRaysOnGpuAsync First"); 
traceRaysOnGpuAsync(0, numRaysPrevious, rank, 0); 
nvtxRangePop(); 

for(int iteration = 0; iteration < numIterations; iteration++) 
{ 

    int rayFrom = (iteration+1)*MAX_RAYS_PER_ITERATION; 
    int rayTo = min((iteration+2)*MAX_RAYS_PER_ITERATION, this->numPrimaryRays) - 1; 
    int numRaysIteration = rayTo-rayFrom+1; 

    // Wait for results to finish and get them 

    waitForGpu(); 
    // Trace the next iteration asynchronously. This will have data prepared for next iteration 

    if(numRaysIteration > 0) 
    { 
     int nextBuffer = (iteration+1) % 2; 
     nvtxRangePushA("traceRaysOnGpuAsync Interior"); 
     traceRaysOnGpuAsync(rayFrom, numRaysIteration, rank, nextBuffer); 
     nvtxRangePop(); 
    } 
    nvtxRangePushA("CpuWork"); 

    // Store results for current iteration 

    int rayOffset = iteration*MAX_RAYS_PER_ITERATION; 
    int buffer = iteration % 2; 

    for(int i = 0; i < numRaysPrevious; i++) 
    { 
     if(this->activeRays[rayOffset+i] && resultHitPanelIds[buffer][i] >= 0) 
     { 
      this->activeRays[rayOffset+i] = false; 
      const TrianglePanelPair & t = this->getTriangle(resultHitPanelIds[buffer][i]); 
      double hitT = resultHitDistances[buffer][i]; 

      Vector3 reflectedDirection = resultReflections[buffer][i]; 

      Result res = Result(rays[rayOffset+i], hitT, t.panel); 
      results[rank].push_back(res); 
      t.panel->incrementIntensity(1.0); 

      if (t.panel->getParent().absorbtion < 1) 
      { 
       numberOfRaysGenerated++; 

       Ray reflected (res.endPoint() + 0.00001*reflectedDirection, reflectedDirection); 

       this->newRays[rayOffset+i] = reflected; 
       this->activeRays[rayOffset+i] = true; 
       numNewRays++; 

      } 
     } 



    } 

    numRaysPrevious = numRaysIteration; 

    nvtxRangePop(); 

} 
+0

您在KdTreeGpu :: traceRaysOnGpuAsync調用後沒有顯示代碼,但這可能很有用,例如查看您在何處以及爲什麼使用cudaDeviceSynchronize()調用?我認爲你在調用KdTreeGpu :: traceRaysOnGpuAsync後立即發出devicesync,但這會消除你的重疊。這是您想要重疊的區域,並假設第二個綠色CpuWork欄不依賴於kdTreeTraversal的結果,那麼您希望移動或消除您的內核函數調用後的deviceync。在設備同步之前重新考慮一些CpuWork *。 –

+0

我添加了一些已經清除了一些定時器的代碼,所以應該更容易遵循。這兩個緩衝區應該使CpuWork獨立於內核啓動。 – apartridge

回答

4

這是Windows預期的行爲與WDDM驅動程序模型,其中駕駛員試圖通過嘗試批量內核啓動,以減輕內核啓動開銷。嘗試在內核調用後直接插入cudaStreamQuery(0),以在批次滿之前觸發內核的早期啓動。

+1

爲避免WDDM驅動程序型號出現性能問題,請考慮切換到TCC驅動程序。 – njuffa

+0

內核之後的一個和兩個memcopy的後一個做了訣竅。內核之後只有一個,memcpy被延遲到syncrhonize。現在,它是在適當的平行。謝謝! – apartridge