產生CUDA運行時錯誤的火炬代碼

我的一位朋友實現了一個實際工作的稀疏版本的torch.bmm，但是當我嘗試一個測試時，我有一個運行時錯誤（與此實現無關），我不明白。我已經看到了幾個關於如何但無法找到解決方案的主題。下面是代碼，並且錯誤：產生CUDA運行時錯誤的火炬代碼

if __name__ == "__main__": 
    tmp = torch.zeros(1).cuda() 
    batch_csr = BatchCSR() 
    sparse_bmm = SparseBMM() 

    i=torch.LongTensor([[0,5,8], [1,5,8], [2,5,8]]) 
    v=torch.FloatTensor([4,3,8]) 
    s=torch.Size([3,500,500]) 

    indices, values, size = i,v,s 

    a_ = torch.sparse.FloatTensor(indices, values, size).cuda().transpose(2, 1) 
    batch_size, num_nodes, num_faces = a_.size() 

    a = a_.to_dense() 

    for _ in range(10): 
     b = torch.randn(batch_size, num_faces, 16).cuda() 
     torch.cuda.synchronize() 
     time1 = time.time() 
     result = torch.bmm(a, b) 
     torch.cuda.synchronize() 
     time2 = time.time() 
     print("{} CuBlas dense bmm".format(time2 - time1)) 

     torch.cuda.synchronize() 
     time1 = time.time() 
     col_ind, col_ptr = batch_csr(a_.indices(), a_.size()) 
     my_result = sparse_bmm(a_.values(), col_ind, col_ptr, a_.size(), b) 
     torch.cuda.synchronize() 
     time2 = time.time() 
     print("{} My sparse bmm".format(time2 - time1)) 

     print("{} Diff".format((result-my_result).abs().max()))

和錯誤：

Traceback (most recent call last): 
    File "sparse_bmm.py", line 72, in <module> 
    b = torch.randn(3, 500, 16).cuda() 
    File "/home/bizeul/virtual_env/lib/python2.7/site-packages/torch/_utils.py", line 65, in _cuda 
    return new_type(self.size()).copy_(self, async) 
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorCopy.c:18

當用命令CUDA_LAUNCH_BLOCKING = 1運行，我得到的錯誤：

/b/wheel/pytorch-src/torch/lib/THC/THCTensorIndex.cu:121: void indexAddSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 1, SrcDim = 1, IdxDim = -2]: block: [0,0,0], thread: [0,0,0] Assertion `dstIndex < dstAddDimSize` failed. 
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THCS/generic/THCSTensorMath.cu line=292 error=59 : device-side assert triggered 
Traceback (most recent call last): 
    File "sparse_bmm.py", line 69, in <module> 
    a = a_.to_dense() 
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THCS/generic/THCSTensorMath.cu:292

來源

2017-06-20 Gericault

好吧，所以cuda在技術上本質上是異步的，所以觸發的斷言錯誤不會帶有堆棧跟蹤。嘗試運行腳本像這樣在你的終端： 'CUDA_LAUNCH_BLOCKING = 1條蟒蛇your_script.py' 並更新你的問題 – entrophy

謝謝，我編輯我的職務 – Gericault

那麼，什麼是你的問題正是* *？ – talonmies

該指數你傳遞來創建稀疏張量是不正確的。

這裏是應該的：

i = torch.LongTensor([[0, 1, 2], [5, 5, 5], [8, 8, 8]])

如何創建一個稀疏張量：

讓我們來簡單的例子。比方說，我們希望下面的張量：

0 0 0 2 0 
    0 0 0 0 0 
    0 0 0 0 20 
[torch.cuda.FloatTensor of size 3x5 (GPU 0)]

正如你可以看到，數（2）需要在稀疏張量的（0,3）位置。數字（20）需要位於（2,4）位置。

爲了創建這一點，我們的指數張量應該是這樣的

[[0 , 2], 
[3 , 4]]

而且，現在的代碼創建上述稀疏張量：關於斷言

i=torch.LongTensor([[0, 2], [3, 4]]) 
v=torch.FloatTensor([2, 20]) 
s=torch.Size([3, 5]) 
a_ = torch.sparse.FloatTensor(indices, values, size).cuda()

更多評論錯誤由cuda：

Assertion 'dstIndex < dstAddDimSize' failed.告訴我們，它的可能性很高，你有一個指數超出博unds。因此，無論何時您注意到，請查找您可能向任何張量提供了錯誤索引的地方。

來源

2017-06-20 20:47:23 entrophy

編輯：我的壞，我知道了！ – Gericault

產生CUDA運行時錯誤的火炬代碼

回答

相關問題