cuFFT的NaN問題

我正在爲使用cuFFT的C++和Cuda的學校作業編寫一個頻率過濾應用程序，我無法使其工作。您可以找到整個Visual Studio 2010解決方案here。（需要glut）cuFFT的NaN問題

這裏是我認爲相關的部分：（fourierUtils.cu/194）

////////////////////////////////////////////////////////////////////////////// 
// Function to help invoking the kernel, creates the parameters and gets 
// the result 
__host__ 
void Process(
     BitmapStruct& in_img, // these contain an image in an rgba byte array 
     BitmapStruct& out_img, 
     MaskGenerator maskGenerator, // this is a pointer to a device function 
     float param1, // mask parameters 
     float param2) 
{  
    // Declare and allocate variables 
    cufftHandle plan; 

    cufftReal* img; 
    cufftReal* dev_img; 
    cufftComplex* dev_freq_img; 

    int imgsize = in_img.image_size(); 
    int pixelcount = imgsize/4; 

    img = new float[pixelcount]; 
    checkResult(
     cudaMalloc(&dev_img, sizeof(cufftReal) * pixelcount)); 
    checkResult(
     cudaMalloc(&dev_freq_img, sizeof(cufftComplex) * pixelcount)); 

    // Optimize execution 
    cudaFuncAttributes attrs; 
    checkResult(
     cudaFuncGetAttributes(&attrs, &Filter)); 
    std::pair<dim3, dim3> params 
     = Optimizer::GetOptimalParameters(pixelcount, attrs); 

    // Process r, g, b channels 
    for(int chan = 0; chan <= 2; chan++) 
    { 
     // Init 
     for(int i = 0; i < pixelcount; i++) 
     { 
      img[i] = in_img.pixels[4 * i + chan]; 
     } 

     checkResult(
      cudaMemcpy(dev_img, img, pixelcount, cudaMemcpyHostToDevice)); 

     // Create frequency image 
     checkResult(
      cufftPlan1d(&plan, pixelcount, CUFFT_R2C, 1)); 
     checkResult(
      cufftExecR2C(plan, dev_img, dev_freq_img)); 
     checkResult(
      cudaThreadSynchronize()); 
     checkResult(
      cufftDestroy(plan)); 

     // Mask frequency image 
     Filter<<<params.first, params.second>>>(
      dev_freq_img, in_img.x, in_img.y, maskGenerator, param1, param2); 
     getLastCudaError("Filtering the image failed."); 

     // Get result 
     checkResult(
      cufftPlan1d(&plan, pixelcount, CUFFT_C2R, 1)); 
     checkResult(
      cufftExecC2R(plan, dev_freq_img, dev_img)); 
     checkResult(
      cudaThreadSynchronize()); 
     checkResult(
      cufftDestroy(plan)); 
     checkResult(
      cudaMemcpy(img, dev_img, pixelcount, cudaMemcpyDeviceToHost)); 

     for(int i = 0; i < pixelcount; i++) 
     { 
      out_img.pixels[4 * i + chan] = img[i]; 
     } 
    } 

    // Copy alpha channel 
    for(int i = 0; i < pixelcount; i++) 
    { 
     out_img.pixels[4 * i + 3] = in_img.pixels[4 * i + 3]; 
    } 

    // Free memory 
    checkResult(
     cudaFree(dev_freq_img)); 
    checkResult(
     cudaFree(dev_img)); 
    delete img; 

    getLastCudaError("An error occured during processing the image."); 
}

我不能看到比我見過的官方例子任何實際的差異，但是當我用Nsight進行調試時，我的內核收到的所有cufftComplex值都是NaN，並且input和result圖像之間的唯一區別在於，結果在底部有一個黑色條，無論是哪個過濾掩碼和哪些參數我用。所有Cuda和cuFFT調用都會返回成功，並且在內核調用後也不會報告錯誤。

我該怎麼做？

我已經嘗試用複雜數組替換img和dev_img，並使用C2C轉換並在原地進行，但它只改變了結果圖像上黑條的大小。

謝謝你的幫助。

編輯：here是一個簡化版本，不需要過剩，也應該在linux上編譯。

來源

2013-11-28 KáGé

如果你省略了過濾步驟，你會得到你的原始圖像還是你仍然會得到'NaN's？ –

我認爲人們會花費一些時間來處理整個VS psoject，包括過度使用，特別是對於linux用戶。你能否提供一個更簡潔的例子來重現你的問題？ – JackOLantern

@PaulR我在過濾步驟中得到了NaNs，但是省略它並不會改變最終結果（過濾器試圖乘以nans，這對它們什麼也不做）。我可能是錯的，但在我看來，我的內核無法訪問dev_freq_img指向的內存（這與設備內存中的奇怪）。而出現的黑條可能是一個不同的問題。 –

我的錯誤是忘記在一些cudaMemcpy調用中將項目數量與其大小相乘，因此饋送給cuFFT的向量的末尾由NaN組成。解決這些問題已經解決了這個問題。

我還用cufftComplex替換了cufftReal數組，因爲C2C轉換似乎更具可預測性，併爲這些值添加了標準化。

所以最終的工作方法是：

/////////////////////////////////////////////////////////////////////////////// 
// Function to help invoking the kernel, creates the parameters and gets 
// the result 
__host__ 
void Process(
     BitmapStruct& in_img, 
     BitmapStruct& out_img, 
     MaskGenerator maskGenerator, 
     float param1, 
     float param2) 
{  
    // Declare and allocate variables 
    cufftHandle plan; 

    cufftComplex* img; 
    cufftComplex* dev_img; 
    cufftComplex* dev_freq_img; 

    int imgsize = in_img.image_size(); 
    int pixelcount = imgsize/4; 

    img = new cufftComplex[pixelcount]; 
    checkResult(
     cudaMalloc(&dev_img, sizeof(cufftComplex) * pixelcount)); 
    checkResult(
     cudaMalloc(&dev_freq_img, sizeof(cufftComplex) * pixelcount)); 

    // Optimize execution 
    cudaFuncAttributes attrs; 
    checkResult(
     cudaFuncGetAttributes(&attrs, &Filter)); 
    std::pair<dim3, dim3> params = 
      Optimizer::GetOptimalParameters(pixelcount, attrs); 

    // Process r, g, b channels 
    for(int chan = 0; chan <= 2; chan++) 
    { 
     // Init 
     for(int i = 0; i < pixelcount; i++) 
     { 
      img[i].x = in_img.pixels[4 * i + chan]; 
      img[i].y = 0; 
     } 

     checkResult(
      cudaMemcpy(
       dev_img, 
       img, 
       pixelcount * sizeof(cufftComplex), 
       cudaMemcpyHostToDevice)); 

     // Create frequency image 
     checkResult(
      cufftPlan1d(&plan, pixelcount, CUFFT_C2C, 1)); 
     checkResult(
      cufftExecC2C(plan, dev_img, dev_freq_img, CUFFT_FORWARD)); 
     checkResult(
      cudaThreadSynchronize()); 
     checkResult(
      cufftDestroy(plan)); 

     // Mask frequency image 
     Filter<<<params.first, params.second>>>(
      dev_freq_img, 
      in_img.x, 
      in_img.y, 
      maskGenerator, 
      param1, 
      param2); 
     getLastCudaError("Filtering the image failed."); 

     // Get result 
     checkResult(
      cufftPlan1d(&plan, pixelcount, CUFFT_C2C, 1)); 
     checkResult(
      cufftExecC2C(plan, dev_freq_img, dev_img, CUFFT_INVERSE)); 
     checkResult(
      cudaThreadSynchronize()); 
     checkResult(
      cufftDestroy(plan)); 
     checkResult(
      cudaMemcpy(
       img, 
       dev_img, 
       pixelcount * sizeof(cufftComplex), 
       cudaMemcpyDeviceToHost)); 

     for(int i = 0; i < pixelcount; i++) 
     { 
      out_img.pixels[4 * i + chan] = img[i].x/pixelcount; 
     } 
    } 

    // Copy alpha channel 
    for(int i = 0; i < pixelcount; i++) 
    { 
     out_img.pixels[4 * i + 3] = in_img.pixels[4 * i + 3]; 
    } 

    // Free memory 
    checkResult(
     cudaFree(dev_freq_img)); 
    checkResult(
     cudaFree(dev_img)); 
    delete img; 

    getLastCudaError("An error occured during processing the image."); 
}

謝謝你的幫助。

來源

2013-11-29 22:10:12

我還沒有編譯和運行縮減版本，但我認爲問題的大小爲dev_img和dev_freq_imag。

考慮「CUFFT庫用戶指南」第4.2節中的示例。它執行就地從真實到複雜的轉換，這與您首先執行的步驟相同。

#define NX 256 

cufftHandle plan; 
cufftComplex *data; 
cudaMalloc((void**)&data, sizeof(cufftComplex)*(NX/2+1)*BATCH); 

cufftPlan1d(&plan, NX, CUFFT_R2C, BATCH); 
cufftExecR2C(plan, (cufftReal*)data, data);

由於變換的對稱特性，cufftExecR2C僅填充NX/2+1輸出元素，其中NX是輸入數組的大小。

在你的情況，你正在做以下幾點：

cufftHandle plan; 

cufftReal* dev_img; 
cufftComplex* dev_freq_img; 

cudaMalloc(&dev_img, sizeof(cufftReal) * pixelcount); 
cudaMalloc(&dev_freq_img, sizeof(cufftComplex) * pixelcount);

所以你分配一個cufftReal陣列和陣列cufftComplex大小相同的。當您使用

cufftPlan1d(&plan, pixelcount, CUFFT_R2C, 1); 
cufftExecR2C(plan, dev_img, dev_freq_img);

那麼只有dev_freq_img的一半是由cufftExecR2C填滿，含有垃圾的剩餘部分。如果您在Filter__global__函數中使用dev_freq_img的全部範圍，那麼這可能是您的NaN的原因。

來源

2013-11-28 21:47:53 JackOLantern

我重寫了它：'cudaMalloc（＆dev_freq_img，sizeof（cufftComplex）* freqImgSize）'其中'int freqImgSize = pixelcount/2 + 1;'並啓動我的內核的許多實例，但不幸的是它沒有改變任何東西。整個數組是NaNs，結果圖像底部有一個黑色條，否則不變。 –

你可以使用'cuda-memcheck'來查看你是否違反了內存邊界？你有沒有發現哪個是生成'NaN'的例程？ – JackOLantern

好的，我會試試。我不知道，我會在每次通話後嘗試將其丟棄。 –

回答

相關問題