CUDA推力陣列長度

我的CUDA內核正在使用推力，按鍵排序和減少。當我使用陣列超過460它開始顯示不正確的結果。CUDA推力陣列長度

任何人都可以解釋這種行爲？或者它與我的機器有關？

儘管尺寸很大，排序仍然正常，但是，REDUCE_BY_KEY運行不正常。並返回不正確的結果。我有4個數組 1）輸入鍵被定義爲wholeSequenceArray。 2）在內核中定義的初始值爲1的輸入值。 3）輸出鍵用於保存輸入鍵的不同值 4）輸出值用於保存對應於相同輸入的輸入值之和關鍵。

有關reduce_by_key更多介紹請訪問此頁： https://thrust.github.io/doc/group__reductions.html#gad5623f203f9b3fdcab72481c3913f0e0

這裏是我的代碼：

#include <cstdlib> 
#include <stdlib.h> 
#include <stdio.h> 
#include <iostream> 
#include <vector> 
#include <fstream> 
#include <string> 
#include <cuda.h> 
#include <cuda_runtime.h> 
#include <thrust/device_vector.h> 
#include <thrust/host_vector.h> 
#include <thrust/sort.h> 
#include <thrust/reduce.h> 
#include <thrust/execution_policy.h> 

using namespace std; 
#define size 461 

__global__ void calculateOccurances(unsigned int *input_keys, 
      unsigned int *output_Values) { 
    int tid = threadIdx.x; 

    const int N = size; 
    __shared__ unsigned int input_values[N]; 

    unsigned int outputKeys[N]; 

    int i = tid; 
    while (i < N) { 
      if (tid < N) { 
        input_values[tid] = 1; 
      } 
      i += blockDim.x; 
    } 
    __syncthreads(); 

    thrust::sort(thrust::device, input_keys, input_keys + N); 

    thrust::reduce_by_key(thrust::device, input_keys, input_keys + N, 
        input_values, outputKeys, output_Values); 

    if (tid == 0) { 
      for (int i = 0; i < N; ++i) { 
        printf("%d,", output_Values[i]); 
      } 
    } 

} 

int main(int argc, char** argv) { 

    unsigned int wholeSequenceArray[size] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
        11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
        10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 
        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 
        8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 
        7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 
        6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 
        5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 
        4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 
        3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 
        2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
        20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
        19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 
        18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
        17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
        16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
        15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
        14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
        13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
        12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
        11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
        10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 
        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 
        8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1 }; 

    cout << "wholeSequenceArray:" << endl; 
    for (int i = 0; i < size; i++) { 
      cout << wholeSequenceArray[i] << ","; 
    } 

    cout << "\nStart C++ Array New" << endl; 
    cout << "Size of Input:" << size << endl; 

    cudaDeviceProp prop; 
    cudaGetDeviceProperties(&prop, 0); 
    printf("Max threads per block: %d\n", prop.maxThreadsPerBlock); 

    unsigned int counts[size]; 
    unsigned int *d_whole; 
    unsigned int *d_counts; 

    cudaMalloc((void**) &d_whole, size * sizeof(unsigned int)); 
    cudaMalloc((void**) &d_counts, size * sizeof(unsigned int)); 

    cudaMemcpy(d_whole, wholeSequenceArray, size * sizeof(unsigned int), 
        cudaMemcpyHostToDevice); 

    calculateOccurances<<<1, size>>>(d_whole, d_counts); 

    cudaMemcpy(counts, d_counts, size * sizeof(unsigned int), 
        cudaMemcpyDeviceToHost); 

    cout << endl << "Counts" << endl << endl; 
    for (int i = 0; i < size; ++i) { 
      cout << counts[i] << ","; 
    } 
    cout << endl; 

    cudaFree(d_whole); 
}

來源

2017-02-12 Emad R

當[檢查CUDA錯誤]時你會得到任何錯誤（http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using -THE-CUDA的運行時API）？ –

不，它運行平穩，我刪除了cuda錯誤代碼只是爲了使代碼更小:) –

我不認爲你明白如何在設備代碼中使用'thrust'工作。你有461個線程，每個線程都是自己做的，**分開**在相同的地方對相同的數據進行排序。這可能不是一個有用的算法。這些461個線程將在彼此移動數據時進行排序。我不清楚你在這裏需要一個CUDA內核。您所描述的算法可以通過以普通方式（即從主機代碼）使用推力來完成。該工作仍將在設備上完成。 –

當你在內核調用一個推力算法，即推力算法在派遣整個來自每個CUDA線程。因此，您的代碼正在同一地點對同一數據（每個CUDA內核線程一次）執行461次排序操作。這意味着每個線程在分類操作過程中移動數據時都會相互移動。

如果您只是想使用您在問題中概述的方法來統計數字的出現次數（有效直方圖），並且您想使用推力，則根本不需要編寫CUDA內核。

如果您確實想從CUDA內核中正確執行此操作，那麼您需要將推力操作（sort和reduce_by_key）限制爲僅從單個線程執行操作。（甚至這種方法將被限制在一個塊中）。

我真的不認爲第二種方法（CUDA內核）有意義，但爲了完整性，我修改了代碼以包含每種方法的正確示例。需要注意的是，一旦你進行還原，不再有在打印出的每個陣列中的所有461項的任何一點，所以我已經限制了打印到第一25個條目，每個陣列其中爲了清楚：

$ cat t91.cu 
#include <cstdlib> 
#include <stdlib.h> 
#include <stdio.h> 
#include <iostream> 
#include <vector> 
#include <fstream> 
#include <string> 
#include <cuda.h> 
#include <cuda_runtime.h> 
#include <thrust/device_vector.h> 
#include <thrust/host_vector.h> 
#include <thrust/sort.h> 
#include <thrust/reduce.h> 
#include <thrust/execution_policy.h> 
#include <thrust/iterator/constant_iterator.h> 

using namespace std; 
#define size 461 

__global__ void calculateOccurances(unsigned int *input_keys, 
      unsigned int *output_Values) { 
    int tid = threadIdx.x; 

    const int N = size; 
    __shared__ unsigned int input_values[N]; 

    unsigned int outputKeys[N]; 

    int i = tid; 
    while (i < N) { 
      if (tid < N) { 
        input_values[tid] = 1; 
      } 
      i += blockDim.x; 
    } 
    __syncthreads(); 
    if (tid == 0){ 
     thrust::sort(thrust::device, input_keys, input_keys + N); 

     thrust::reduce_by_key(thrust::device, input_keys, input_keys + N, 
        input_values, outputKeys, output_Values); 
     } 

    if (tid == 0) { 
    printf("from kernel:\n"); 
      for (int i = 0; i < 25; ++i) { 
        printf("%d,", output_Values[i]); 
      } 
    } 

} 

int main(int argc, char** argv) { 

    unsigned int wholeSequenceArray[size] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
        11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
        10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 
        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 
        8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 
        7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 
        6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 
        5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 
        4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 
        3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 
        2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
        20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
        19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 
        18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
        17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
        16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
        15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
        14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
        13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
        12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
        11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
        10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 
        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 
        8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1 }; 

    cout << "wholeSequenceArray:" << endl; 
    for (int i = 0; i < size; i++) { 
      cout << wholeSequenceArray[i] << ","; 
    } 

    cout << "\nStart C++ Array New" << endl; 
    cout << "Size of Input:" << size << endl; 

    cudaDeviceProp prop; 
    cudaGetDeviceProperties(&prop, 0); 
    printf("Max threads per block: %d\n", prop.maxThreadsPerBlock); 

//just using thrust 

    thrust::device_vector<int> d_seq(wholeSequenceArray, wholeSequenceArray+size); 
    thrust::device_vector<int> d_val_out(size); 
    thrust::device_vector<int> d_key_out(size); 

    thrust::sort(d_seq.begin(), d_seq.end()); 
    int rsize = thrust::get<0>(thrust::reduce_by_key(d_seq.begin(), d_seq.end(), thrust::constant_iterator<int>(1), d_key_out.begin(), d_val_out.begin())) - d_key_out.begin(); 
    std::cout << "rsize:" << rsize << std::endl; 
    std::cout << "Thrust keys:" << std::endl; 
    thrust::copy_n(d_key_out.begin(), rsize, std::ostream_iterator<int>(std::cout, ",")); 
    std::cout << std::endl << "Thrust vals:" << std::endl; 
    thrust::copy_n(d_val_out.begin(), rsize, std::ostream_iterator<int>(std::cout, ",")); 
    std::cout << std::endl; 


// in a cuda kernel 


    unsigned int counts[size]; 
    unsigned int *d_whole; 
    unsigned int *d_counts; 

    cudaMalloc((void**) &d_whole, size * sizeof(unsigned int)); 
    cudaMalloc((void**) &d_counts, size * sizeof(unsigned int)); 

    cudaMemcpy(d_whole, wholeSequenceArray, size * sizeof(unsigned int), 
        cudaMemcpyHostToDevice); 

    calculateOccurances<<<1, size>>>(d_whole, d_counts); 

    cudaMemcpy(counts, d_counts, size * sizeof(unsigned int), 
        cudaMemcpyDeviceToHost); 

    std::cout << "from Host:" << std::endl; 
    cout << endl << "Counts" << endl << endl; 
    for (int i = 0; i < 25; ++i) { 
      cout << counts[i] << ","; 
    } 
    cout << endl; 

    cudaFree(d_whole); 
} 
$ nvcc -arch=sm_61 -o t91 t91.cu 
$ ./t91 
wholeSequenceArray: 
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1, 
Start C++ Array New 
Size of Input:461 
Max threads per block: 1024 
rsize:20 
Thrust keys: 
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20, 
Thrust vals: 
24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23, 
from kernel: 
24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,526324,526325,526325,526327,526329,from Host: 

Counts 

24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,526324,526325,526325,526327,526329, 
$

注意事項：

我在推力示例中包含了一個方法，因此您可以準確知道輸出數組的大小。
推力方法應該獨立於size參數工作正常 - 受GPU的限制（如內存大小）的限制。 CUDA內核方法實際上只是從單個線程執行推力代碼，因此運行超過1個塊並不明智。
您可能希望參考this question/answer以獲取有關使用CUDA內核推力的更多討論。

來源

2017-02-13 02:22:23

CUDA推力陣列長度

回答

相關問題