2011-01-22 47 views
4

我有一些代碼要編入cuda內核。看哪:CUDA:嵌入式循環內核

for (r = Y; r < Y + H; r+=2) 
    { 
     ch1RowSum = ch2RowSum = ch3RowSum = 0; 
     for (c = X; c < X + W; c+=2) 
     { 
      chan1Value = //some calc'd value 
          chan3Value = //some calc'd value 
      chan2Value = //some calc'd value 
      ch2RowSum += chan2Value; 
      ch3RowSum += chan3Value; 
      ch1RowSum += chan1Value; 
     } 
     ch1Mean += ch1RowSum/W; 
     ch2Mean += ch2RowSum/W; 
     ch3Mean += ch3RowSum/W; 
    } 

如果有這樣的分成兩個內核,一個計算RowSums和一個計算方式,我應該如何處理的事實,我的循環指數不以零開始,在N個結束?

+0

嘗試選擇一個問題,它很難選擇正確的答案。但是,至於你的第二個問題......很難專門回答,但我認爲一旦你開發內核的時候你會看到更遠。 – jmilloy 2011-01-22 23:24:14

+0

你應該用每塊H塊和W線程的配置啓動你的內核。然後,您將從內核中的blockIdx和threadIdx值計算r和c。計算r和c然而你想...我試圖把這個在我的答案下面... – jmilloy 2011-01-22 23:26:18

回答

1

假設您有一個計算三個值的內核。配置中的每個線程將計算每個(r,c)對的三個值。

__global__ value_kernel(Y, H, X, W) 
{ 
    r = blockIdx.x + Y; 
    c = threadIdx.x + W; 

    chan1value = ... 
    chan2value = ... 
    chan3value = ... 
} 

我不相信你可以在上面的內核中計算總和(完全並行,至少)。你將無法像上面那樣使用+ =。你可以把它們都放在一個內核,如果你在每個塊(行)只有一個線程做之和的意思是,像這樣...

__global__ both_kernel(Y, H, X, W) 
{ 
    r = blockIdx.x + Y; 
    c = threadIdx.x + W; 

    chan1value = ... 
    chan2value = ... 
    chan3value = ... 

    if(threadIdx.x == 0) 
    { 
     ch1RowSum = 0; 
     ch2RowSum = 0; 
     ch3RowSum = 0; 

     for(i=0; i<blockDim.x; i++) 
     { 
      ch1RowSum += chan1value; 
      ch2RowSum += chan2value; 
      ch3RowSum += chan3value; 
     } 

     ch1Mean = ch1RowSum/blockDim.x; 
     ch2Mean = ch2RowSum/blockDim.x; 
     ch3Mean = ch3RowSum/blockDim.x; 
    } 
} 

,但它可能會更好使用的第一個價值內核,然後第二個內核既可以用於彙總也可以用於......可以在下面進一步對內核進行並行處理,如果它們是分開的,則可以在準備就緒時專注於該內核。

__global__ sum_kernel(Y,W) 
{ 
    r = blockIdx.x + Y; 

    ch1RowSum = 0; 
    ch2RowSum = 0; 
    ch3RowSum = 0; 

    for(i=0; i<W; i++) 
    { 
     ch1RowSum += chan1value; 
     ch2RowSum += chan2value; 
     ch3RowSum += chan3value; 
    } 

    ch1Mean = ch1RowSum/W; 
    ch2Mean = ch2RowSum/W; 
    ch3Mean = ch3RowSum/W; 
}