用於大型數組的前綴掃描

-3

我想使用GPUgem中的指令爲大型數組編寫前綴掃描，這是我的並行類的作業。我確實按照書中的所有步驟進行了操作，但仍然無法執行代碼。我得到它爲數組大小4096工作，但它不適用於較大的數組。這裏是我的代碼：用於大型數組的前綴掃描

#include <stdio.h> 
#include <sys/time.h> 
#define THREADS 1024 
typedef int mytype; 

__global__ void phaseI(mytype *g_odata, mytype *g_idata, int n, mytype *aux) 
{ 
    __shared__ mytype temp[THREADS]; 
    const int tid1 = threadIdx.x; 
    int offset = 1; 
    temp[2*tid1] = g_idata[2*tid1]; // load input into shared memory 
    temp[2*tid1+1] = g_idata[2*tid1+1]; 
    for (int d = THREADS>>1; d > 0; d >>= 1) // build sum in place up the tree 
    { 
    __syncthreads(); 
    if (tid1 < d) 
    { 
     int ai = offset*(2*tid1+1)-1; 
     int bi = offset*(2*tid1+2)-1; 
     temp[bi] += temp[ai]; 
    } 
    offset *= 2; 
    } 
    __syncthreads(); 
    if (tid1 == 0) { 
    aux[blockIdx.x] = temp[THREADS - 1]; 
    temp[THREADS - 1] = 0; 
    } 
for (int d = 1; d < THREADS; d *= 2) // traverse down tree & build scan 
    { 
     offset >>= 1; 
     __syncthreads(); 
     if (tid1 < d) 
     { 
     int ai = offset*(2*tid1+1)-1; 
     int bi = offset*(2*tid1+2)-1; 
     mytype t = temp[ai]; 
     temp[ai] = temp[bi]; 
     temp[bi] += t; 
     } 
    } 
    __syncthreads(); 
    g_odata[2*thid] = temp[2*thid]; // write results to device memory 
    g_odata[2*thid+1] = temp[2*thid+1]; 
    } 

__global__ void phaseII(mytype *g_odata, mytype *aux, int n) 
{ 
    const int tid1 = threadIdx.x; 
    const int B = (n/THREADS); 
    int offset = 1; 
for (int d = B>>1; d > 0; d >>= 1) // build sum in place up the tree 
    { 
    __syncthreads(); 
    if (tid1 < d) 
    { 
     int ai = offset*(2*tid1+1)-1; 
     int bi = offset*(2*tid1+2)-1; 
     temp[bi] += temp[ai]; 
    } 
    offset *= 2; 
    } 
    __syncthreads(); 
    if (tid1 == 0 && blockIdx.x == 0) { 
    aux[B - 1] = 0; 
    } 
for (int d = 1; d < B; d *= 2) // traverse down tree & build scan 
    { 
     offset >>= 1; 
     __syncthreads(); 
     if (tid1 < d) 
     { 
     int ai = offset*(2*tid1+1)-1; 
     int bi = offset*(2*tid1+2)-1; 
     mytype t = temp[ai]; 
     temp[ai] = temp[bi]; 
     temp[bi] += t; 
     } 
    } 
    __syncthreads(); 
    g_odata[2*thid] += aux[blockIdx.x]; 
    g_odata[2*thid+1] += aux[blockIdx.x]; 
} 

int main(int argc, char *argv[]) 
{ 
    if (argc != 2) { 
    printf("usage: %s n\n", argv[0]); 
    return -1; 
    } 
    const int n = atoi(argv[1]); 
    mytype *h_i, *d_i, *h_o, *d_o, *d_temp; 
    const int size = n * sizeof(mytype); 
    h_i = (mytype *)malloc(size); 
    h_o = (mytype *)malloc(size); 
    if ((h_i == NULL) || (h_o == NULL)) { 
    printf("malloc failed\n"); 
    return -1; 
    } 
    for (int i = 0; i < n; i++) { 
    h_i[i] = i; 
    h_o[i] = 0; 
    } 
    cudaMalloc(&d_i, size); 
    cudaMalloc(&d_temp, (n/THREADS)); 
    cudaMalloc(&d_o, size); 
    cudaMemset(d_o, 0, size); 
    cudaMemset(d_temp, 0, (n/THREADS)); 
    cudaMemcpy(d_i, h_i, size, cudaMemcpyHostToDevice); 
    int blocks = n/THREADS; 
    phaseI<<<blocks, THREADS/2 >>>(d_o, d_i, n, d_temp); 
    phaseII<<<blocks, THREADS/2>>>(d_o, d_temp, n); 
    cudaThreadSynchronize(); 
    cudaMemcpy(h_o, d_o, size, cudaMemcpyDeviceToHost); 
    printf("\n"); 
    for (int i = 0; i < n ; i++) { 
    printf(" %d", h_o[i]); 
    } 
    printf("\n\n"); 

    return 0; 
}

有沒有人有任何想法我做錯了什麼？

來源

2015-07-02 Shewartz

請試一下，並詢問您收到的具體錯誤，或特定的編程問題。顯示你的代碼！ – JimiLoe

這個問題沒有意義。根據定義，塊是獨立的，並且共享內存具有塊範圍。你想問什麼？ – talonmies

您需要發佈可以被其他人編譯的完整代碼。發佈後，您的問題沒有意義。 –

一個可能的錯誤，我在你的代碼中看到的是在這裏：

aux[thid] = temp[THREADS];

如果您temp陣列是temp[1024]，就像你說的，每塊有1024個線程，就像你說的，那麼如果將線程爲1024， temp [THREADS]將訪問您的共享內存數組越界（一個結尾）。一個包含1024個元素的數組只有從0到1023的有效索引。

除此之外，它似乎是你問如何從共享內存數組中取出最後一個元素（temp），並將其放置在（推測爲全局）aux數組，每個塊有一個元素。

這裏是一個完全樣例：

$ cat t831.cu 
#include <stdio.h> 

#define THREADS 1024 
#define BLOCKS 20 

__global__ void kernel(int *aux){ 

    __shared__ int temp[THREADS]; 
    temp[threadIdx.x] = threadIdx.x + blockIdx.x; 
    __syncthreads(); 
    if (threadIdx.x == 0) 
    aux[blockIdx.x] = temp[THREADS-1]; 
} 

int main(){ 

    int *h_data, *d_data; 
    const int dsize = BLOCKS*sizeof(int); 
    h_data=(int *)malloc(dsize); 
    cudaMalloc(&d_data, dsize); 
    memset(h_data, 0, dsize); 
    cudaMemset(d_data, 0, dsize); 
    kernel<<<BLOCKS, THREADS>>>(d_data); 
    cudaMemcpy(h_data, d_data, dsize, cudaMemcpyDeviceToHost); 
    for (int i = 0; i < BLOCKS; i++) printf("%d, ", h_data[i]); 
    printf("\n"); 
    return 0; 
} 

$ nvcc -o t831 t831.cu 
$ cuda-memcheck ./t831 
========= CUDA-MEMCHECK 
1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 
========= ERROR SUMMARY: 0 errors 
$

來源

2015-07-03 01:05:53

謝謝，我知道了:) – Shewartz

我在我的代碼中使用了你的建議，我想我有共享內存部分工作，但是我的代碼沒有給我一個正確的結果 – Shewartz

用於大型數組的前綴掃描

回答

相關問題