我正在使用PyCUDA,CUDAMat和Numba對GPU矩陣乘法進行基準測試,並遇到了一些行爲,我無法找到解釋方法。
我計算了3個不同步驟獨立需要的時間 - 將2個矩陣發送到設備存儲器,計算點積,並將結果複製回主機存儲器。
點積步驟的基準測試在一個循環中完成,因爲我的應用程序在發送結果之前將進行多次乘法運算。將矩陣複製到主機所需的時間會增加矩陣的使用次數
隨着我增加循環次數,點積時間線性增加,就像預期一樣。但我無法理解的部分是,將最終結果發送回主機內存所需的時間也隨循環次數線性增加,即使它只是將一個矩陣複製回主機內存。無論你做多少個矩陣乘法循環,結果的大小都是恆定的,所以這沒有意義。它的行爲就好像返回最終結果需要返回循環中每個步驟的所有中間結果。
一些有趣的事情要注意的是,它所花費的時間增加有一個高峯。當我在循環中超過〜1000點產品時,複製最終結果所需的時間達到峯值。 另一件事是,如果在點積循環內,我重新初始化包含結果的矩陣,則無論執行多少次乘法,此行爲都會停止,並且複製返回時間相同。
例如 -
for i in range(1000):
gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1))
result = gc.get()
最後一點要注意的是,這種情況兩個PyCUDA和Numba,但不與CUDAMat發生。我可以做一百萬次乘法,並且檢索最終結果仍然需要相同的時間。 CUDAMat有一個內置的矩陣乘法,這可能是爲什麼,但是對於PyCUDA和Numba,我使用在他們自己的文檔中提供的矩陣乘法代碼。
這裏是我的PyCUDA
代碼from __future__ import division
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import time
import pycuda.autoinit
kernel_code_template = """
__global__ void MatrixMulKernel(float *A, float *B, float *C)
{
const int wA = %(MATRIX_SIZE)s;
const int wB = %(MATRIX_SIZE)s;
// Block index
const int bx = blockIdx.x;
const int by = blockIdx.y;
// Thread index
const int tx = threadIdx.x;
const int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
const int aBegin = wA * %(BLOCK_SIZE)s * by;
// Index of the last sub-matrix of A processed by the block
const int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
const int aStep = %(BLOCK_SIZE)s;
// Index of the first sub-matrix of B processed by the block
const int bBegin = %(BLOCK_SIZE)s * bx;
// Step size used to iterate through the sub-matrices of B
const int bStep = %(BLOCK_SIZE)s * wB;
// The element of the block sub-matrix that is computed
// by the thread
float Csub = 0;
// Loop over all the sub-matrices of A and B required to
// compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)
{
// Shared memory for the sub-matrix of A
__shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Shared memory for the sub-matrix of B
__shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Load the matrices from global memory to shared memory
// each thread loads one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to global memory;
// each thread writes one element
const int c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
C[c + wB * ty + tx] = Csub;
}
"""
MATRIX_SIZE = 512
TILE_SIZE = 8
BLOCK_SIZE = TILE_SIZE
np.random.seed(100)
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
kernel_code = kernel_code_template % {
'MATRIX_SIZE': MATRIX_SIZE,
'BLOCK_SIZE': BLOCK_SIZE,
}
mod = compiler.SourceModule(kernel_code)
matrixmul = mod.get_function("MatrixMulKernel")
#copy to device memory
total = time.clock()
ga = gpuarray.to_gpu(a_cpu)
gb = gpuarray.to_gpu(b_cpu)
gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
copy_to = time.clock() - total
#matrix multiplication
mult = time.clock()
for i in range(1000):
matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1))
mult = time.clock() - mult
#copy result back to host memory
copy_from = time.clock()
res = gc.get()
copy_from = time.clock() - copy_from
total = time.clock() - total
#print out times for all 3 steps and the total time taken
print(copy_to)
print(mult)
print(copy_from)
print(total)
我曾考慮類似的東西,但不知道如何搜索它。這工作很好。如果你想發佈一個答案,我會接受它 – Frobot