2012-11-16 41 views
0

我移植這段代碼:爲什麼這個CUDA內核與原始代碼有不同的結果?

if(_layersCount > 1) 
    { 
     for(int i=_layersCount-2;i>=0;i--) 
     { 
      for(int j=0;j<_neuronsPerLayerCount[i];j++) // cuda kernel 
      { 
       localGradients[indexByLayerAndNeuron(i, j)] = 0; 

       for(int k=0;k<_neuronsPerLayerCount[i+1];k++) 
       { 
        localGradients[indexByLayerAndNeuron(i, j)] += _neuronsInputsWeights[indexByLayerNeuronAndInput(i+1, k, j)] 
                    * localGradients[indexByLayerAndNeuron(i+1, k)]; 
       } 

       localGradients[indexByLayerAndNeuron(i, j)] *= derivatives[indexByLayerAndNeuron(i, j)]; 
      } 
     } 
    } 

到CUDA:

if(_layersCount > 1) 
    { 
     for(int i=_layersCount-2;i>=0;i--) 
     { 
      // calculateLocalGradientsForAnotherLayers 
      blocksCount = floor((double) _neuronsPerLayerCount[i]/threads.x) + 1; 
      blocks = dim3(blocksCount, 1); 

      calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i], _inputsInCurrentLayer[i]); 
     } 
    } 

的calculateLocalGradientsForAnotherLayers內核:

__global__ void calculateLocalGradientsForAnotherLayers(double * localGradients, double * neuronsInputsWeights, double * derivatives, int neuronsCount, int neuronsInPreviousLayers, int neuronsInPreviousLayersWithCurrent, int neuronsInNextLayer, int inputsInPreviousLayers, int inputsInCurrentLayer) 
{ 
    int idx = blockIdx.x * blockDim.x + threadIdx.x; 

    if(idx < neuronsCount) 
    { 
     int neuron = neuronsInPreviousLayers + idx; 

     localGradients[neuron] = 0; 

     // this to Kernel, then reduce localGradients. 
     for(int k=0;k<neuronsInNextLayer;k++) 
     { 
      localGradients[neuron] += neuronsInputsWeights[inputsInPreviousLayers + k*inputsInCurrentLayer + idx] 
                  * localGradients[neuronsInPreviousLayersWithCurrent + k]; 
     } 

     localGradients[neuron] *= derivatives[neuron]; 
    } 
} 

但我看到從小數點後第二位在結果的差異。爲什麼錯誤如此之大?除此之外,所有的內核工作都很好。

我的GPU是NV GF555M。它支持雙精度。

+1

你如何調用內核?網格/塊大小。 – ahmad

+0

查看第二個代碼塊。 threads.x是512 – Robotex

回答

1

我發現問題。相反路線:

calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i], _inputsInCurrentLayer[i]); 

應該寫道:

calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i+1], _inputsInCurrentLayer[i+1]); 
0

在內核的身體,你需要某種塊間同步的過localGradients陣列:

for(int k=0;k<neuronsInNextLayer;k++) 
     { 
      localGradients[neuron] += neuronsInputsWeights[inputsInPreviousLayers + k*inputsInCurrentLayer + idx] 
                  * localGradients[neuronsInPreviousLayersWithCurrent + k]; 
     } 

並行讀/寫訪問可能會破壞localGradients元素的實際值。由於讀/寫沒有同步,您可能會看到隨機結果。

+0

如何添加同步? 如果我將它保存到臨時變量然後保存到localGradients [神經元]? – Robotex

+0

讓我們從問題主體中顯示的序列代碼開始(代碼塊1)。讓我知道具有獨立迭代的「for」循環(i,j和k)嗎?顯然'k'循環的迭代是依賴的。怎麼樣'我'和'j'循環? – ahmad

相關問題