CUDA 5.0：CUBIN和CUBLAS_device，計算能力3.5

我試圖編譯一個使用動態並行性來運行CUBLAS到一個cubin文件的內核。當我嘗試使用命令CUDA 5.0：CUBIN和CUBLAS_device，計算能力3.5

nvcc -cubin -m64 -lcudadevrt -lcublas_device -gencode arch=compute_35,code=sm_35 -o test.cubin -c test.cu

我得到ptxas fatal : Unresolved extern function 'cublasCreate_v2

如果我添加它編譯罰款-rdc=true編譯選項編譯代碼，但是當我嘗試使用cuModuleLoad加載模塊，收到錯誤500 ：CUDA_ERROR_NOT_FOUND。從cuda.h：

/** 
* This indicates that a named symbol was not found. Examples of symbols 
* are global/constant variable names, texture names, and surface names. 
*/ 
CUDA_ERROR_NOT_FOUND      = 500,

內核代碼：

#include <stdio.h> 
#include <cublas_v2.h> 
extern "C" { 
__global__ void a() { 
    cublasHandle_t cb_handle = NULL; 
    cudaStream_t stream; 
    if(threadIdx.x == 0) { 
     cublasStatus_t status = cublasCreate_v2(&cb_handle); 
     cublasSetPointerMode_v2(cb_handle, CUBLAS_POINTER_MODE_HOST); 
     if (status != CUBLAS_STATUS_SUCCESS) { 
      return; 
     } 
     cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); 
     cublasSetStream_v2(cb_handle, stream); 
    } 
    __syncthreads(); 
    int jp; 
    double A[3]; 
    A[0] = 4.0f; 
    A[1] = 5.0f; 
    A[2] = 6.0f; 
    cublasIdamax_v2(cb_handle, 3, A, 1, &jp); 
} 
}

注：的A範圍是本地的，所以在給cublasIdamax_v2指針的數據是不確定的，因此jp端在此代碼中或多或少是隨機值。正確的做法是在全局內存中有A。

主機代碼：

#include <stdio.h> 
#include <cuda.h> 
#include <cuda_runtime_api.h> 

int main() { 
    CUresult error; 
    CUdevice cuDevice; 
    CUcontext cuContext; 
    CUmodule cuModule; 
    CUfunction testkernel; 
    // Initialize 
    error = cuInit(0); 
    if (error != CUDA_SUCCESS) printf("ERROR: cuInit, %i\n", error); 
    error = cuDeviceGet(&cuDevice, 0); 
    if (error != CUDA_SUCCESS) printf("ERROR: cuInit, %i\n", error); 
    error = cuCtxCreate(&cuContext, 0, cuDevice); 
    if (error != CUDA_SUCCESS) printf("ERROR: cuCtxCreate, %i\n", error); 
    error = cuModuleLoad(&cuModule, "test.cubin"); 
    if (error != CUDA_SUCCESS) printf("ERROR: cuModuleLoad, %i\n", error); 
    error = cuModuleGetFunction(&testkernel, cuModule, "a"); 
    if (error != CUDA_SUCCESS) printf("ERROR: cuModuleGetFunction, %i\n", error); 
    return 0; 
}

主機代碼使用nvcc -lcuda test.cpp編譯。如果我用一個簡單的內核（下面）替換內核，並且在不使用-rdc=true的情況下編譯它，它可以正常工作。

簡單的工作核心

#include <stdio.h> 
extern "C" { 
__global__ void a() { 
    printf("hello\n"); 
} 
}

在此先感謝

索倫

來源

2013-03-14 Soren

有，爲什麼你使用驅動程序API的一個原因？ – KiaMorot 2013-03-14 12:33:53

KiaMorot：我使用pycuda，它使用驅動程序API。我包括C代碼的原因是爲了讓它更透明 – Soren 2013-03-14 12:36:05

你只是在你的第一種方法缺少-dlink：

nvcc -cubin -m64 -lcudadevrt -lcublas_device -gencode arch=compute_35,code=sm_35 -o test.cubin -c test.cu -dlink

你也可以做，在兩個步驟：

nvcc -m64 test.cu -gencode arch=compute_35,code=sm_35 -o test.o -dc 
nvcc -dlink test.o -arch sm_35 -lcublas_device -lcudadevrt -cubin -o test.cubin

來源

2013-03-15 01:32:22

謝謝，你讓我的一天:) – Soren 2013-03-15 08:55:19

有沒有人有解釋爲什麼我需要兩個步驟的編譯？ – Soren 2013-03-15 09:01:08

好的問題Soren，用一步法更新了我的答案。 – 2013-03-15 18:48:56

CUDA 5.0：CUBIN和CUBLAS_device，計算能力3.5

回答

相關問題