CUDA共享庫連接：未定義參照cudaRegisterLinkedBinary

目標：CUDA共享庫連接：未定義參照cudaRegisterLinkedBinary

創建包含我的CUDA內核共享庫具有CUDA - 自由包裝/報頭。
爲共享庫創建一個test可執行文件。

問題

共享庫MYLIB.so看起來編譯罰款。（沒問題）。
錯誤鏈接：

./libMYLIB.so: undefined reference to __cudaRegisterLinkedBinary_39_tmpxft_000018cf_00000000_6_MYLIB_cpp1_ii_74c599a1

簡化的Makefile：

libMYlib.so : MYLIB.o 
    g++ -shared -Wl,-soname,libMYLIB.so -o libMYLIB.so MYLIB.o -L/the/cuda/lib/dir -lcudart 


MYLIB.o : MYLIB.cu MYLIB.h 
    nvcc -m64 -arch=sm_20 -dc -Xcompiler '-fPIC' MYLIB.cu -o MYLIB.o -L/the/cuda/lib/dir -lcudart 


test : test.cpp libMYlib.so 
     g++ test.cpp -o test -L. -ldl -Wl,-rpath,. -lMYLIB -L/the/cuda/lib/dir -lcudart

確實

nm libMYLIB.so表明所有 CUDA API本功能離子是「未定義的符號」：

  U __cudaRegisterFunction 
     U __cudaRegisterLinkedBinary_39_tmpxft_0000598c_00000000_6_CUPA_cpp1_ii_74c599a1 
     U cudaEventRecord 
     U cudaFree 
     U cudaGetDevice 
     U cudaGetDeviceProperties 
     U cudaGetErrorString 
     U cudaLaunch 
     U cudaMalloc 
     U cudaMemcpy

所以CUDA不知何故沒有得到鏈接到共享庫是指mylib.so 我缺少什麼？

CUDA甚至沒有鏈接到莫名其妙的對象文件：

nm MYLIB.o

  U __cudaRegisterFunction 
     U __cudaRegisterLinkedBinary_39_tmpxft_0000598c_00000000_6_CUPA_cpp1_ii_74c599a1 
     U cudaEventRecord 
     U cudaFree 
     U cudaGetDevice 
     U cudaGetDeviceProperties 
     U cudaGetErrorString 
     U cudaLaunch 
     U cudaMalloc 
     U cudaMemcpy

（同上）

來源

2013-06-24 cmo

沒有靜態版本的cuda運行時庫，因此您不應該期望看到靜態包含在對象或共享庫中的運行時庫符號，因此您最近的兩次編輯/添加在這裏是紅色的。 – talonmies

好的，我不知道，好點。 – cmo

@talonmies實際上以CUDA Toolkit 5.5開頭，還有一個靜態版本的CUDA運行時庫 – RoBiK

這裏是沿線爲例Linux共享對象的創建你表示：

創建一個包含我的CUDA內核的共享庫，該庫具有一個無CUDA封裝/頭的。
爲共享庫創建測試可執行文件。

首先是共享庫。對於此構建的命令如下：

nvcc -arch=sm_20 -Xcompiler '-fPIC' -dc test1.cu test2.cu 
nvcc -arch=sm_20 -Xcompiler '-fPIC' -dlink test1.o test2.o -o link.o 
g++ -shared -o test.so test1.o test2.o link.o -L/usr/local/cuda/lib64 -lcudart

看來你可能在你的makefile缺少上述第二個步驟，但如果與你的makefile有任何其他問題，我還沒有進行分析。

現在，對於測試可執行文件，編譯命令如下：

g++ -c main.cpp 
g++ -o testmain main.o test.so

要運行它，只需執行testmain可執行文件，但可以肯定的test.so庫是在你的LD_LIBRARY_PATH。

這些是我用於測試目的的文件：

test1.h：

int my_test_func1();

test1.cu：

#include <stdio.h> 
#include "test1.h" 

#define DSIZE 1024 
#define DVAL 10 
#define nTPB 256 

#define cudaCheckErrors(msg) \ 
    do { \ 
     cudaError_t __err = cudaGetLastError(); \ 
     if (__err != cudaSuccess) { \ 
      fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ 
       msg, cudaGetErrorString(__err), \ 
       __FILE__, __LINE__); \ 
      fprintf(stderr, "*** FAILED - ABORTING\n"); \ 
      exit(1); \ 
     } \ 
    } while (0) 

__global__ void my_kernel1(int *data){ 
    int idx = threadIdx.x + (blockDim.x *blockIdx.x); 
    if (idx < DSIZE) data[idx] =+ DVAL; 
} 

int my_test_func1(){ 

    int *d_data, *h_data; 
    h_data = (int *) malloc(DSIZE * sizeof(int)); 
    if (h_data == 0) {printf("malloc fail\n"); exit(1);} 
    cudaMalloc((void **)&d_data, DSIZE * sizeof(int)); 
    cudaCheckErrors("cudaMalloc fail"); 
    for (int i = 0; i < DSIZE; i++) h_data[i] = 0; 
    cudaMemcpy(d_data, h_data, DSIZE * sizeof(int), cudaMemcpyHostToDevice); 
    cudaCheckErrors("cudaMemcpy fail"); 
    my_kernel1<<<((DSIZE+nTPB-1)/nTPB), nTPB>>>(d_data); 
    cudaDeviceSynchronize(); 
    cudaCheckErrors("kernel"); 
    cudaMemcpy(h_data, d_data, DSIZE * sizeof(int), cudaMemcpyDeviceToHost); 
    cudaCheckErrors("cudaMemcpy 2"); 
    for (int i = 0; i < DSIZE; i++) 
    if (h_data[i] != DVAL) {printf("Results check failed at offset %d, data was: %d, should be %d\n", i, h_data[i], DVAL); exit(1);} 
    printf("Results check 1 passed!\n"); 
    return 0; 
}

test2.h：

int my_test_func2();

test2.cu：

#include <stdio.h> 
#include "test2.h" 

#define DSIZE 1024 
#define DVAL 20 
#define nTPB 256 

#define cudaCheckErrors(msg) \ 
    do { \ 
     cudaError_t __err = cudaGetLastError(); \ 
     if (__err != cudaSuccess) { \ 
      fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ 
       msg, cudaGetErrorString(__err), \ 
       __FILE__, __LINE__); \ 
      fprintf(stderr, "*** FAILED - ABORTING\n"); \ 
      exit(1); \ 
     } \ 
    } while (0) 

__global__ void my_kernel2(int *data){ 
    int idx = threadIdx.x + (blockDim.x *blockIdx.x); 
    if (idx < DSIZE) data[idx] =+ DVAL; 
} 

int my_test_func2(){ 

    int *d_data, *h_data; 
    h_data = (int *) malloc(DSIZE * sizeof(int)); 
    if (h_data == 0) {printf("malloc fail\n"); exit(1);} 
    cudaMalloc((void **)&d_data, DSIZE * sizeof(int)); 
    cudaCheckErrors("cudaMalloc fail"); 
    for (int i = 0; i < DSIZE; i++) h_data[i] = 0; 
    cudaMemcpy(d_data, h_data, DSIZE * sizeof(int), cudaMemcpyHostToDevice); 
    cudaCheckErrors("cudaMemcpy fail"); 
    my_kernel2<<<((DSIZE+nTPB-1)/nTPB), nTPB>>>(d_data); 
    cudaDeviceSynchronize(); 
    cudaCheckErrors("kernel"); 
    cudaMemcpy(h_data, d_data, DSIZE * sizeof(int), cudaMemcpyDeviceToHost); 
    cudaCheckErrors("cudaMemcpy 2"); 
    for (int i = 0; i < DSIZE; i++) 
    if (h_data[i] != DVAL) {printf("Results check failed at offset %d, data was: %d, should be %d\n", i, h_data[i], DVAL); exit(1);} 
    printf("Results check 2 passed!\n"); 
    return 0; 
}

main.cpp中：

#include <stdio.h> 

#include "test1.h" 
#include "test2.h" 

int main(){ 

    my_test_func1(); 
    my_test_func2(); 
    return 0; 
}

當我按照給定的命令編譯和運行./testmain我得到：

$ ./testmain 
Results check 1 passed! 
Results check 2 passed!

請注意，如果你願意，你可以生成libtest.so而不是test.so，然後您可以對測試可執行文件使用修改的構建序列：

g++ -c main.cpp 
g++ -o testmain main.o -L. -ltest

我不認爲它有任何區別，但它可能是更熟悉的語法。

我確定有多種方法可以完成此操作。這只是一個例子。您也可以查看nvcc manual的相關部分，並查看examples。

編輯：我測試此下CUDA 5.5 RC，和最終應用鏈接步驟抱怨沒有找到cudart LIB（warning: libcudart.so.5.5., needed by ./libtest.so, not found）。但是，下面的相對簡單的修改（例如Makefile）應該可以在cuda 5.0或cuda 5.5下運行。

的Makefile：

testmain : main.cpp libtest.so 
     g++ -c main.cpp 
     g++ -o testmain -L. -ldl -Wl,-rpath,. -ltest -L/usr/local/cuda/lib64 -lcudart main.o 

libtest.so : link.o 
     g++ -shared -Wl,-soname,libtest.so -o libtest.so test1.o test2.o link.o -L/usr/local/cuda/lib64 -lcudart 

link.o : test1.cu test2.cu test1.h test2.h 
     nvcc -m64 -arch=sm_20 -dc -Xcompiler '-fPIC' test1.cu test2.cu 
     nvcc -m64 -arch=sm_20 -Xcompiler '-fPIC' -dlink test1.o test2.o -o link.o 

clean : 
     rm -f testmain test1.o test2.o link.o libtest.so main.o

來源

2013-06-25 01:44:24

問題依然存在。按照您的示例，所有內容都可以順利編譯，直到最後一步 - 創建測試可執行文件，此時會拋出__cudaRegisterLinkedBinary_39_tmpxft ...錯誤，如前所述。 – cmo

我不確定問題可能是什麼。它似乎對我完美地工作。你是否按照我的步驟準確使用我的文件？你使用cuda 5.0嗎？ –

@MatthewParks我與__cudaRegisterLinkedBinary_39_tmpxft ...有同樣的問題，您是否解決了這個問題 –

你試過顯式地禁用重定位裝置的代碼？即-rdc=false？我得到這undefined reference to __cudaRegisterLinkedBinaryWhatever與-rdc=true，當我刪除它，它就消失了。雖然我不夠專業人士來解釋究竟發生了什麼。

來源

2015-11-24 20:01:08 einpoklum

CUDA共享庫連接：未定義參照cudaRegisterLinkedBinary

回答

相關問題