我的CUDA內核正在使用推力,按鍵排序和減少。 當我使用陣列超過460它開始顯示不正確的結果。CUDA推力陣列長度
任何人都可以解釋這種行爲?或者它與我的機器有關?
儘管尺寸很大,排序仍然正常,但是,REDUCE_BY_KEY運行不正常。並返回不正確的結果。我有4個數組 1)輸入鍵被定義爲wholeSequenceArray。 2)在內核中定義的初始值爲1的輸入值。 3)輸出鍵用於保存輸入鍵的不同值 4)輸出值用於保存對應於相同輸入的輸入值之和關鍵。
有關reduce_by_key更多介紹請訪問此頁: https://thrust.github.io/doc/group__reductions.html#gad5623f203f9b3fdcab72481c3913f0e0
這裏是我的代碼:
#include <cstdlib>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <cuda.h>
#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
using namespace std;
#define size 461
__global__ void calculateOccurances(unsigned int *input_keys,
unsigned int *output_Values) {
int tid = threadIdx.x;
const int N = size;
__shared__ unsigned int input_values[N];
unsigned int outputKeys[N];
int i = tid;
while (i < N) {
if (tid < N) {
input_values[tid] = 1;
}
i += blockDim.x;
}
__syncthreads();
thrust::sort(thrust::device, input_keys, input_keys + N);
thrust::reduce_by_key(thrust::device, input_keys, input_keys + N,
input_values, outputKeys, output_Values);
if (tid == 0) {
for (int i = 0; i < N; ++i) {
printf("%d,", output_Values[i]);
}
}
}
int main(int argc, char** argv) {
unsigned int wholeSequenceArray[size] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1 };
cout << "wholeSequenceArray:" << endl;
for (int i = 0; i < size; i++) {
cout << wholeSequenceArray[i] << ",";
}
cout << "\nStart C++ Array New" << endl;
cout << "Size of Input:" << size << endl;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
printf("Max threads per block: %d\n", prop.maxThreadsPerBlock);
unsigned int counts[size];
unsigned int *d_whole;
unsigned int *d_counts;
cudaMalloc((void**) &d_whole, size * sizeof(unsigned int));
cudaMalloc((void**) &d_counts, size * sizeof(unsigned int));
cudaMemcpy(d_whole, wholeSequenceArray, size * sizeof(unsigned int),
cudaMemcpyHostToDevice);
calculateOccurances<<<1, size>>>(d_whole, d_counts);
cudaMemcpy(counts, d_counts, size * sizeof(unsigned int),
cudaMemcpyDeviceToHost);
cout << endl << "Counts" << endl << endl;
for (int i = 0; i < size; ++i) {
cout << counts[i] << ",";
}
cout << endl;
cudaFree(d_whole);
}
當[檢查CUDA錯誤]時你會得到任何錯誤(http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using -THE-CUDA的運行時API)? –
不,它運行平穩,我刪除了cuda錯誤代碼只是爲了使代碼更小:) –
我不認爲你明白如何在設備代碼中使用'thrust'工作。你有461個線程,每個線程都是自己做的,**分開**在相同的地方對相同的數據進行排序。這可能不是一個有用的算法。這些461個線程將在彼此移動數據時進行排序。我不清楚你在這裏需要一個CUDA內核。您所描述的算法可以通過以普通方式(即從主機代碼)使用推力來完成。該工作仍將在設備上完成。 –