將矩陣複製到主機所需的時間會增加矩陣的使用次數

我正在使用PyCUDA，CUDAMat和Numba對GPU矩陣乘法進行基準測試，並遇到了一些行爲，我無法找到解釋方法。
我計算了3個不同步驟獨立需要的時間 - 將2個矩陣發送到設備存儲器，計算點積，並將結果複製回主機存儲器。
點積步驟的基準測試在一個循環中完成，因爲我的應用程序在發送結果之前將進行多次乘法運算。將矩陣複製到主機所需的時間會增加矩陣的使用次數

隨着我增加循環次數，點積時間線性增加，就像預期一樣。但我無法理解的部分是，將最終結果發送回主機內存所需的時間也隨循環次數線性增加，即使它只是將一個矩陣複製回主機內存。無論你做多少個矩陣乘法循環，結果的大小都是恆定的，所以這沒有意義。它的行爲就好像返回最終結果需要返回循環中每個步驟的所有中間結果。

一些有趣的事情要注意的是，它所花費的時間增加有一個高峯。當我在循環中超過〜1000點產品時，複製最終結果所需的時間達到峯值。另一件事是，如果在點積循環內，我重新初始化包含結果的矩陣，則無論執行多少次乘法，此行爲都會停止，並且複製返回時間相同。
例如 -

for i in range(1000): 
    gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32) 
    matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1)) 
result = gc.get()

最後一點要注意的是，這種情況兩個PyCUDA和Numba，但不與CUDAMat發生。我可以做一百萬次乘法，並且檢索最終結果仍然需要相同的時間。 CUDAMat有一個內置的矩陣乘法，這可能是爲什麼，但是對於PyCUDA和Numba，我使用在他們自己的文檔中提供的矩陣乘法代碼。

這裏是我的PyCUDA

代碼

from __future__ import division 
import numpy as np 
from pycuda import driver, compiler, gpuarray, tools 
import time 
import pycuda.autoinit 

kernel_code_template = """ 
__global__ void MatrixMulKernel(float *A, float *B, float *C) 
{ 

    const int wA = %(MATRIX_SIZE)s; 
    const int wB = %(MATRIX_SIZE)s; 

    // Block index 
    const int bx = blockIdx.x; 
    const int by = blockIdx.y; 

    // Thread index 
    const int tx = threadIdx.x; 
    const int ty = threadIdx.y; 

    // Index of the first sub-matrix of A processed by the block 
    const int aBegin = wA * %(BLOCK_SIZE)s * by; 
    // Index of the last sub-matrix of A processed by the block 
    const int aEnd = aBegin + wA - 1; 
    // Step size used to iterate through the sub-matrices of A 
    const int aStep = %(BLOCK_SIZE)s; 

    // Index of the first sub-matrix of B processed by the block 
    const int bBegin = %(BLOCK_SIZE)s * bx; 
    // Step size used to iterate through the sub-matrices of B 
    const int bStep = %(BLOCK_SIZE)s * wB; 

    // The element of the block sub-matrix that is computed 
    // by the thread 
    float Csub = 0; 
    // Loop over all the sub-matrices of A and B required to 
    // compute the block sub-matrix 
    for (int a = aBegin, b = bBegin; 
     a <= aEnd; 
     a += aStep, b += bStep) 
    { 
     // Shared memory for the sub-matrix of A 
     __shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s]; 
     // Shared memory for the sub-matrix of B 
     __shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s]; 

     // Load the matrices from global memory to shared memory 
     // each thread loads one element of each matrix 
     As[ty][tx] = A[a + wA * ty + tx]; 
     Bs[ty][tx] = B[b + wB * ty + tx]; 
     // Synchronize to make sure the matrices are loaded 
     __syncthreads(); 

     // Multiply the two matrices together; 
     // each thread computes one element 
     // of the block sub-matrix 
     for (int k = 0; k < %(BLOCK_SIZE)s; ++k) 
     Csub += As[ty][k] * Bs[k][tx]; 

     // Synchronize to make sure that the preceding 
     // computation is done before loading two new 
     // sub-matrices of A and B in the next iteration 
     __syncthreads(); 
    } 

    // Write the block sub-matrix to global memory; 
    // each thread writes one element 
    const int c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx; 
    C[c + wB * ty + tx] = Csub; 
} 
""" 


MATRIX_SIZE = 512 
TILE_SIZE = 8 
BLOCK_SIZE = TILE_SIZE 
np.random.seed(100) 
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32) 
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32) 

kernel_code = kernel_code_template % { 
    'MATRIX_SIZE': MATRIX_SIZE, 
    'BLOCK_SIZE': BLOCK_SIZE, 
} 
mod = compiler.SourceModule(kernel_code) 
matrixmul = mod.get_function("MatrixMulKernel") 


#copy to device memory 
total = time.clock() 
ga = gpuarray.to_gpu(a_cpu) 
gb = gpuarray.to_gpu(b_cpu) 
gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32) 
copy_to = time.clock() - total 

#matrix multiplication 
mult = time.clock() 
for i in range(1000): 
    matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1)) 
mult = time.clock() - mult 

#copy result back to host memory 
copy_from = time.clock() 
res = gc.get() 
copy_from = time.clock() - copy_from 
total = time.clock() - total 

#print out times for all 3 steps and the total time taken 
print(copy_to) 
print(mult) 
print(copy_from) 
print(total)

來源

2017-09-01 Frobot

我曾考慮類似的東西，但不知道如何搜索它。這工作很好。如果你想發佈一個答案，我會接受它 – Frobot

GPU內核啓動是異步。這意味着您認爲您在for循環（執行乘法所花費的時間）周圍進行的測量不是那麼簡單。這只是將內核啓動發佈到隊列所需的時間。

實際的內核執行時間被「吸收」到設備 - >主機拷貝時間的最終測量中（因爲D-> H拷貝在所有內核開始之前強制完成並且阻塞CPU線程）。

關於「峯值」行爲，當您向隊列中啓動足夠的內核時，最終會停止異步並開始阻塞CPU線程，因此您的「執行時間」度量開始上升。這解釋了變化的峯值行爲。

「修理」這一點，如果你後立即插入pycuda driver.Context.synchronize()您的for循環，而在此之前行：

mult = time.clock() - mult

，你會看到你的執行時間增加爲你增加的循環，您的D-> H複印時間將保持不變。

來源

2017-09-01 00:51:18

和任何人使用Numba與同樣的問題，你可以打電話給numba.cuda.synchronize（） – Frobot

將矩陣複製到主機所需的時間會增加矩陣的使用次數

回答

相關問題