2016-12-22 32 views
0

問題: 我獲得兩個計算的陣列和兩個期望輸出在MATLAB Cuda的獲取隨機值次級輸出

  1. 右計算出的輸出
  2. 隨機數,舊號碼,號碼從另一陣列

我使用MATLAB R2016B這科達版本+ GPU:

CUDADevice with properties: 

        Name: 'GeForce GT 525M' 
       Index: 1 
    ComputeCapability: '2.1' 
     SupportsDouble: 1 
     DriverVersion: 8 
     ToolkitVersion: 7.5000 
    MaxThreadsPerBlock: 1024 
     MaxShmemPerBlock: 49152 
    MaxThreadBlockSize: [1024 1024 64] 
      MaxGridSize: [65535 65535 65535] 
      SIMDWidth: 32 
      TotalMemory: 1.0737e+09 
     AvailableMemory: 947929088 
    MultiprocessorCount: 2 
      ClockRateKHz: 1200000 
      ComputeMode: 'Default' 
    GPUOverlapsTransfers: 1 
KernelExecutionTimeout: 1 
     CanMapHostMemory: 1 
     DeviceSupported: 1 
     DeviceSelected: 1 

我現在將嘗試使用GPU添加和減去兩個不同的數組,並將其返回給MATLAB。

MATLAB代碼:

n = 10; 
as = [1,1,1]; 
bs = [10,10,10]; 

for i = 2:n+1 
    as(end+1,:) = [i,i,i]; 
    bs(end+1,:) = [10,10,10]; 
end 
as = as *1; 

% Load the kernel 
cudaFilename = 'add2.cu'; 
ptxFilename = ['add2.ptx']; 

% Check if the files are awareable 
if((exist(cudaFilename, 'file') || exist(ptxFilename, 'file')) == 2) 
    error('CUDA FILES ARE NOT HERE'); 
end 
kernel = parallel.gpu.CUDAKernel(ptxFilename, cudaFilename); 

% Make sure we have sufficient blocks to cover all of the locations 
kernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock,1,1]; 
kernel.GridSize = [ceil(n/kernel.MaxThreadsPerBlock),1]; 

% Call the kernel 
outadd = zeros(n,1, 'single'); 
outminus = zeros(n,1, 'single'); 
[outadd, outminus] = feval(kernel, outadd,outminus, as, bs); 

Cuda的片斷

#include "cuda_runtime.h" 
#include "add_wrapper.hpp" 
#include <stdio.h> 

__device__ size_t calculateGlobalIndex() { 
    // Which block are we? 
    size_t const globalBlockIndex = blockIdx.x + blockIdx.y * gridDim.x; 
    // Which thread are we within the block? 
    size_t const localThreadIdx = threadIdx.x + blockDim.x * threadIdx.y; 
    // How big is each block? 
    size_t const threadsPerBlock = blockDim.x*blockDim.y; 
    // Which thread are we overall? 
    return localThreadIdx + globalBlockIndex*threadsPerBlock; 
} 

__global__ void addKernel(float *c, float *d, const float *a, const float *b) 
{ 
    int i = calculateGlobalIndex(); 
    c[i] = a[i] + b[i]; 
    d[i] = a[i] - b[i]; 
} 

// C = A + B 
// D = A - B 
void addWithCUDA(float *cpuC,float *cpuD, const float *cpuA, const float *cpuB, const size_t sz) 
{ 
//TODO: add error checking 

// choose which GPU to run on 
cudaSetDevice(0); 

// allocate GPU buffers 
float *gpuA, *gpuB, *gpuC, *gpuD; 
cudaMalloc((void**)&gpuA, sz*sizeof(float)); 
cudaMalloc((void**)&gpuB, sz*sizeof(float)); 
cudaMalloc((void**)&gpuC, sz*sizeof(float)); 
cudaMalloc((void**)&gpuD, sz*sizeof(float)); 
cudaCheckErrors("cudaMalloc fail"); 

// copy input vectors from host memory to GPU buffers 
cudaMemcpy(gpuA, cpuA, sz*sizeof(float), cudaMemcpyHostToDevice); 
cudaMemcpy(gpuB, cpuB, sz*sizeof(float), cudaMemcpyHostToDevice); 

// launch kernel on the GPU with one thread per element 
addKernel<<<1,sz>>>(gpuC, gpuD, gpuA, gpuB); 

// wait for the kernel to finish 
cudaDeviceSynchronize(); 

// copy output vector from GPU buffer to host memory 
cudaMemcpy(cpuC, gpuC, sz*sizeof(float), cudaMemcpyDeviceToHost); 
cudaMemcpy(cpuD, gpuD, sz*sizeof(float), cudaMemcpyDeviceToHost); 


// cleanup 
cudaFree(gpuA); 
cudaFree(gpuB); 
cudaFree(gpuC); 
cudaFree(gpuD); 
} 

void resetDevice() 
{ 
    cudaDeviceReset(); 
} 

[outadd, outminus]在MATLAB 2個GPU陣列對象運行的代碼之後。

Outadd總是正確計算,outminus很少正確的,大多含有隨機整數或浮點數,零或outadd有時甚至價值觀。

如果我換算術運算順序它的作品了,所以另外一所以不是「outminus」應該被正確地計算?

+0

歡迎堆棧溢出。你似乎忘了問一個問題。問題用問號(?)表示並可以接收答案。請[編輯]您的文章以包含一個問題,因爲它看起來很不錯! – Adriaan

+0

'kernel.MaxThreadsPerBlock'爲1024.由於'n'爲10,因此即使您只需要10個內核,您的內核也會啓動1個1024線程的數據塊。這些額外的線程可能會越界訪問您的數組,因此您應該通過'n'作爲內核的標量參數,而在你的內核中,你應該對'n'測試'i'。你可能想[這裏MATLAB示例]研究這個(https://www.mathworks.com/help/distcomp/examples/illustrating-three-approaches-to-gpu-computing-the-mandelbrot-set.html)。 –

+0

@Robert Crovella我想我只能呆在這裏,我會重新使用限制線程。謝謝! – Jeahinator

回答

1

使用@Robert Crovella暗示不必要的線程可能會導致訪問錯誤,我只是添加了對線程的限制。

MATLAB

[outadd, outminus] = feval(kernel, outadd,outminus, as, bs, n); 

CUDA核方法

__global__ void addKernel(float *c, float *d, const float *a, const float *b, const float n) 
{ 
    int i = calculateGlobalIndex(); 
    if (i < n){ 
     c[i] = a[i] + b[i]; 
     d[i] = a[i] - b[i]; 
    } 
} 

我認爲它仍然不是最佳的解決方案,因爲GPU仍然啓動的所有線程即使是最不應該使用很多資源。

以適當的方式重新加工之後,我會在這裏上傳。