CUDA - 爲什麼我的設備數據沒有傳輸到主機？

我目前對CUDA編程有特殊的困難 - 更具體地說，在複製和讀取設備發送回主機的陣列時。當我試圖讀取我應該返回給我的數據時，我得到的只是垃圾數據。任何人都可以看看我的代碼片段，並告訴我我做錯了什麼？非常感謝你！CUDA - 爲什麼我的設備數據沒有傳輸到主機？

struct intss { 
u_int32_t one; 
u_int32_t two; 
}; 



int main() 
{ 
    int block_size = 3;    
    int grid_size = 1; 

    intss *device_fb = 0; 
    intss *host_fb = 0; 


    int num_bytes_fb = (block_size*grid_size)*sizeof(intss); 


host_fb = (intss*)malloc(num_bytes_fb); 
cudaMalloc((void **)&device_fb, num_bytes_fb); 

    .... 

    render2<<<block_size,grid_size>>>(device_fb, device_pixelspercore, samples, obj_list_flat_dev, numOpsPerCore, lnumdev, camdev, lightsdev, uranddev, iranddev); 


    .... 

    cudaMemcpy(host_fb, device_fb, num_bytes_fb, cudaMemcpyDeviceToHost); 


    printf("output %d ", host_fb[0].one); 

    printf("output %d ", host_fb[1].one); 

    printf("output %d ", host_fb[2].one); 
    //Note that I'm only looking at elements the 3 elements 0-2 from host_fb. I am doing this because block_size*grid_size = 3. Is this wrong? 

    cudaFree(device_fb); 
    free(host_fb); 
} 


__global__ void render2(intss *device_fb, struct parallelPixels *pixelsPerCore, int  samples, double *obj_list_flat_dev, int numOpsPerCore, int lnumdev, struct camera camdev, struct vec3 *lightsdev, struct vec3 *uranddev, int *iranddev)   //SPECIFY ARGUMENTS!!! 
{ 
int index = blockIdx.x * blockDim.x + threadIdx.x; //DETERMINING INDEX BASED ON WHICH THREAD IS CURRENTLY RUNNING 

.... 

//computing data... 


device_fb[index].one = (((u_int32_t)(MIN(r, 1.0) * 255.0) & 0xff) << RSHIFT | 
        ((u_int32_t)(MIN(g, 1.0) * 255.0) & 0xff) << GSHIFT | 
        ((u_int32_t)(MIN(b, 1.0) * 255.0) & 0xff) << BSHIFT); 
}

編輯：

多虧了一個建議，我已經實現了CudaErrorCheck功能在我的程序，而且似乎是其功能是給我的錯誤的模式。在我的程序中，我有一堆全局主機數組（obj_list，lights，urand，irand）。每當我嘗試使用cudaMemCpy將這些主機陣列複製到設備陣列時，我會收到以下錯誤：「文件'cudatrace.cu'中的cuda錯誤在行x：無效參數中。

OBJ_LIST和燈被填充在下面的函數，load_scene（）：

空隙load_scene（FILE * fp的）{ 炭線[256]，* PTR，類型;

obj_list = (sphere *)malloc(sizeof(struct sphere)); 
obj_list->next = 0; 
objCounter = 0; 

while((ptr = fgets(line, 256, fp))) { 
    int i; 
    struct vec3 pos, col; 
    double rad, spow, refl; 

    while(*ptr == ' ' || *ptr == '\t') ptr++; 
    if(*ptr == '#' || *ptr == '\n') continue; 

    if(!(ptr = strtok(line, DELIM))) continue; 
    type = *ptr; 

    for(i=0; i<3; i++) { 
     if(!(ptr = strtok(0, DELIM))) break; 
     *((double*)&pos.x + i) = atof(ptr); 
    } 

    if(type == 'l') { 
     lights[lnum++] = pos; 
     continue; 
    } 

    if(!(ptr = strtok(0, DELIM))) continue; 
    rad = atof(ptr); 

    for(i=0; i<3; i++) { 
     if(!(ptr = strtok(0, DELIM))) break; 
     *((double*)&col.x + i) = atof(ptr); 
    } 

    if(type == 'c') { 
     cam.pos = pos; 
     cam.targ = col; 
     cam.fov = rad; 
     continue; 
    } 

    if(!(ptr = strtok(0, DELIM))) continue; 
    spow = atof(ptr); 

    if(!(ptr = strtok(0, DELIM))) continue; 
    refl = atof(ptr); 

    if(type == 's') { 
     objCounter++; 
     struct sphere *sph = (sphere *)malloc(sizeof(*sph)); 
     sph->next = obj_list->next; 
     obj_list->next = sph; 

     sph->pos = pos; 
     sph->rad = rad; 
     sph->mat.col = col; 
     sph->mat.spow = spow; 
     sph->mat.refl = refl; 

    } else { 
     fprintf(stderr, "unknown type: %c\n", type); 
    } 
}

}

urand和艾蘭德被填充主要如下：

/* initialize the random number tables for the jitter */ 
for(i=0; i<NRAN; i++) urand[i].x = (double)rand()/RAND_MAX - 0.5; 
for(i=0; i<NRAN; i++) urand[i].y = (double)rand()/RAND_MAX - 0.5; 
for(i=0; i<NRAN; i++) irand[i] = (int)(NRAN * ((double)rand()/RAND_MAX));

我不認爲無效的參數可以由器件陣列引起的，因爲cudaMalloc呼叫建立cudaMemcpy調用之前的設備陣列沒有CudaError消息。例如，在以下幾行代碼中：

cudaErrorCheck(cudaMalloc((void **)&lightsdev, MAX_LIGHTS*sizeof(struct vec3))); 

cudaErrorCheck(cudaMemcpy(&lightsdev, &lights, sizeof(struct vec3) * MAX_LIGHTS, cudaMemcpyHostToDevice));

cudaMalloc沒有產生錯誤，但是cudaMemcpy沒有。

如果我沒有提供我的代碼足夠的信息，我已經貼了整個代碼：http://pastebin.com/UgzABPgH

（請注意，在引擎收錄的版本，我拿出其正在生產中的錯誤在CudaMemcpy年代CudaErrorCheck功能）

非常感謝！

編輯：其實，我只是試圖看看如果urand和irand不是全局的，並且如果它們與設備數組uranddev和iranddev一起初始化，會發生什麼。我仍然得到相同的「無效參數」錯誤，所以無論變量是否爲全局變量都不能涉及問題。

來源

2011-11-15 albireneo

您是否嘗試過分配「已知」數據？即：device_fb [index] .one = blockIdx.x;期望的值是0,1,2。 – pQB

是的，我只是嘗試了'已知'的數據，似乎有一個連接。我不太清楚這意味着什麼，因爲被調用來產生輸出的函數應該順序工作。 – albireneo

相反，通過「連接」，我的意思是主機陣列可以讀取設備陣列正在複製的正確值。 – albireneo

我認爲你沒有正確使用<<< >>>語法。

下面是來自CUDA Programming Guide內核調用：

MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

這將意味着網格大小應該先走。

對內核參數的最大大小也有限制。見this。如果你超越它，我不確定編譯器是否抱怨，或只是繼續做壞事。

如果我刪除除device_fb之外的所有參數，並且只在內核中設置device_fb[index]=index，我可以成功讀取這些值。

來源

2011-11-15 06:21:19 Vlad

感謝您的回覆！不幸的是，我將調用中的block_size和grid_size位置換成了render2（），並且數據仍然沒有被讀取。 – albireneo

我不相信我的參數超過了256字節的限制，因爲它們大多數都是指針。我可以嘗試刪除參數，但我需要所有這些參數來執行必要的計算。 – albireneo

最糟糕的情況下，您可以將它們打包爲一個結構並將指針傳遞給該結構。我認爲你可能超過256的一個論點是我看到你正在傳遞價值的'camdev'。無論哪種方式，您都可以嘗試將'device_fb'設置爲已知值，以進一步隔離問題。 – Vlad

當你發佈不完整的，不可編譯的代碼而沒有正確描述實際問題時，絕對不可能說出任何內容。通過在StackOverflow上提出更好的問題，你會得到更好的答案。

話雖如此。最可能的問題不是數據沒有被複制到設備或從設備複製，而是內核本身沒有運行。每個CUDA運行時API調用都會返回一個狀態碼，您應該檢查所有這些狀態碼。您可以定義一個錯誤檢查宏像這樣的：

#include <stdio.h> 

#define cudaErrorCheck(call) { cudaAssert(call,__FILE__,__LINE__) } 

void cudaAssert(const cudaError err, const char *file, const int line) 
{ 
    if(cudaSuccess != err) {             
     fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",   
       file, line, cudaGetErrorString(err)); 
     exit(1); 
    } 
}

和包裝每一個API調用，就像這樣：

cudaErrorCheck(cudaMemcpy(host_fb, device_fb, num_bytes_fb, cudaMemcpyDeviceToHost));

對於內核的推出，本身就可以檢查發射失敗或運行時錯誤是這樣的：

kernel<<<....>>>(); 
cudaErrorCheck(cudaPeekAtLastError()); // Checks for launch error 
cudaErrorCheck(cudaThreadSynchronize()); // Checks for execution error

我的建議是添加徹底的錯誤檢查你的代碼，然後回來和你得到的結果編輯你的問題。然後有人可能會提供有關發生的具體建議。

來源

2011-11-15 07:06:17 talonmies

感謝您的迴應 - 抱歉所提供的信息不足。我正在嘗試將我的API調用包裝在ErrorCheck函數中，但每次調用時都會收到以下錯誤： cudatrace.cu（337）：error：expected a「;」這是一個我如何包裝一個函數的例子： cudaErrorCheck（cudaMalloc（（void **）＆device_pixelspercore，num_bytes_ParallelPixel））; 謝謝 – albireneo

talonmies忘了一個;在我認爲的宏觀中。它應該是#define cudaErrorCheck（call）{cudaAssert（call，__FILE __，__LINE __）; }（FILE和LINE之後沒有空格，因格式化而不得不使用它們） – jmsu

CUDA - 爲什麼我的設備數據沒有傳輸到主機？

回答

相關問題