使用CUDA計算矩陣的相應行之間的歐幾里得距離

我有一個非常簡單的算法，可以計算兩個矩陣的相應行之間的平方歐幾里德距離。我有以下代碼，但不幸的是，它不會爲不同的矩陣大小返回正確的結果。更具體地講，它工作正常大小2000x4，500x4，2500x2，600x8，1000x8，100x8的矩陣，但它是不工作的大小2500x3，2500x5，400x3，100x3，100x10，1000x10，1000x12，500x12，500x14的矩陣。使用CUDA計算矩陣的相應行之間的歐幾里得距離

任何人都可以幫助我嗎？我想手動執行，而不使用任何優化庫，因爲我想了解線程管理。

__global__ void cudaEuclid(float* A, float* B, float* C, int rows, int cols) 
    { 
     int i, squareeucldist = 0; 
     int r = blockDim.x * blockIdx.x + threadIdx.x; // rows 
     int c = blockDim.y * blockIdx.y + threadIdx.y; // cols 
     extern __shared__ float sdata[]; 
     //int r = blockIdx.y; int c = threadIdx.x; 
     if(r < rows && c < cols ){ 

      //C[r + rows*c] = (A[r + rows*c] - B[r + rows*c]) * (A[r + rows*c] - B[r + rows*c]); 


      sdata[threadIdx.x] = (A[r + rows*c] - B[r + rows*c]) * (A[r + rows*c] - B[r + rows*c]); 

      __syncthreads(); 

      // contiguous range pattern 
      for(int offset = blockDim.x/2; 
       offset > 0; 
       offset >>= 1) 
      { 
       if(threadIdx.x < offset) 
       { 
        // add a partial sum upstream to our own 
        sdata[threadIdx.x] += sdata[threadIdx.x + offset]; 
       } 

       // wait until all threads in the block have 
       // updated their partial sums 
       __syncthreads(); 
      } 

      // thread 0 writes the final result 
      if(threadIdx.x == 0) 
      { 
       C[r] = sdata[0]; 
      } 

     } 

    }

內核調用是：

dim3 dimBlock(cols, 1); 
dim3 dimGrid(1, rows); 
cudaEuclid<<<dimGrid, cols, cols*sizeof(float)>>>(d_A, d_B, d_C, rows, cols);

PS：我想提一提，我已經發布了類似的問題，但它是從一開始就不清，討論是無所適從。儘管Tom提出了一個非常有用的建議，認爲未來優化實施將非常實用，但我需要更多的手工製作。最後，我發表這篇文章的原因是因爲我不想讓相關文章更加複雜。謝謝。

來源

2013-10-11 Darkmoor

您測試過60x8嗎？或者您在60x5時停止了嗎？奇數列似乎沒有正確處理。或者甚至可能是2的給予'偏移>> = 1'的非冪... – chappjc

它正在爲60x8工作。 – Darkmoor

有道理，這就是問題所在，儘管Eric給出了一個完整的答案。 – chappjc

實際上，當n足夠小時，您的代碼僅適用於m * 2^n。你可能想更仔細閱讀第14頁下面的幻燈片，

和思考以下幾個問題

當你blockDim.x等於3或5會發生什麼;
當blockDim.x或cols不是2的冪時，並行還原可以如何正確完成;
爲什麼減少結果小於預期;
哪個元素在sdata[]沒有被添加到最後的總和;
當cols爲5時，如果將blockDim.x和smem的大小設置爲2^3，結果是否正確;
在Q5的情況下，如何處理多餘的3元空間中smem[5..7]

儘量模擬一步用你的筆和紙將有助於運行for循環的一步。

來源

2013-10-11 18:56:22 kangshiyin

我在更新帖子的同時給出了答案。順便說一下，它不是在60x3上工作。 – Darkmoor

只需要添加幾行來處理cols不是2的情況。 – kangshiyin

雖然OP不想使用優化的庫來回答他的問題，但是這篇文章有一個有用的標題，其他用戶可以發現在沒有手寫內核的情況下解決問題很有用。

我很好奇，並在解決問題的過程中扮演了一些角色，並且牢記使用CUDA Thrust。我結束了下面的代碼，它使用thrust::reduce_by_key來計算兩個矩陣的同行之間的距離。

#include <thrust\device_vector.h> 
#include <thrust\transform_reduce.h> 
#include <thrust\sequence.h> 
#include <thrust\random.h> 
#include <thrust\gather.h> 
#include <thrust\extrema.h> 

using namespace thrust::placeholders; 

/****************************************************/ 
/* POWER DIFFERENCE FUNCTOR FOR EUCLIDEAN DISTANCES */ 
/****************************************************/ 
struct PowerDifference { 
    __host__ __device__ float operator()(const float& a, const float& b) const { return pow(a - b, 2); } 
}; 

/*******************/ 
/* EXPAND OPERATOR */ 
/*******************/ 
template <typename InputIterator1, typename InputIterator2, typename OutputIterator> 
OutputIterator expand(InputIterator1 first1, 
         InputIterator1 last1, 
         InputIterator2 first2, 
         OutputIterator output) 
{ 
    typedef typename thrust::iterator_difference<InputIterator1>::type difference_type; 

    difference_type input_size = thrust::distance(first1, last1); 
    difference_type output_size = thrust::reduce(first1, last1); 

    // scan the counts to obtain output offsets for each input element 
    thrust::device_vector<difference_type> output_offsets(input_size, 0); 
    thrust::exclusive_scan(first1, last1, output_offsets.begin()); 

    // scatter the nonzero counts into their corresponding output positions 
    thrust::device_vector<difference_type> output_indices(output_size, 0); 
    thrust::scatter_if(thrust::counting_iterator<difference_type>(0), thrust::counting_iterator<difference_type>(input_size), 
         output_offsets.begin(), first1, output_indices.begin()); 

    // compute max-scan over the output indices, filling in the holes 
    thrust::inclusive_scan(output_indices.begin(), output_indices.end(), output_indices.begin(), thrust::maximum<difference_type>()); 

    // gather input values according to index array (output = first2[output_indices]) 
    OutputIterator output_end = output; thrust::advance(output_end, output_size); 
    thrust::gather(output_indices.begin(), output_indices.end(), first2, output); 

    // return output + output_size 
    thrust::advance(output, output_size); 

    return output; 
} 

/********/ 
/* MAIN */ 
/********/ 
int main() 
{ 
    /**************************/ 
    /* SETTING UP THE PROBLEM */ 
    /**************************/ 

    const int N  = 10;   // --- Number of vector elements 
    const int Nvec = 20;   // --- Number of vectors for each matrix 

    // --- Random uniform integer distribution between 0 and 100 
    thrust::default_random_engine rng; 
    thrust::uniform_int_distribution<int> dist(0, 20); 

    // --- Matrix allocation and initialization 
    thrust::device_vector<float> d_matrix1(Nvec * N); 
    thrust::device_vector<float> d_matrix2(Nvec * N); 
    for (size_t i = 0; i < d_matrix1.size(); i++) d_matrix1[i] = (float)dist(rng); 
    for (size_t i = 0; i < d_matrix2.size(); i++) d_matrix2[i] = (float)dist(rng); 

    printf("\n\nFirst matrix\n"); 
    for(int i = 0; i < Nvec; i++) { 
     std::cout << " [ "; 
     for(int j = 0; j < N; j++) 
      std::cout << d_matrix1[i * N + j] << " "; 
     std::cout << "]\n"; 
    } 

    printf("\n\nSecond matrix\n"); 
    for(int i = 0; i < Nvec; i++) { 
     std::cout << " [ "; 
     for(int j = 0; j < N; j++) 
      std::cout << d_matrix2[i * N + j] << " "; 
     std::cout << "]\n"; 
    } 

    /****************************************************************************/ 
    /* CALCULATING THE EUCLIDEAN DISTANCES BETWEEN THE ROWS OF THE TWO MATRICES */ 
    /****************************************************************************/ 
    // --- Creating the indices for the reduction by key 
    thrust::device_vector<int> d_sequence(Nvec); 
    thrust::device_vector<int> d_indices(Nvec * N); 
    thrust::device_vector<int> d_counts(Nvec, N); 
    thrust::sequence(d_sequence.begin(), d_sequence.begin() + Nvec); 
    expand(d_counts.begin(), d_counts.end(), d_sequence.begin(), d_indices.begin()); 

    printf("\n\nSecond matrix\n"); 
    for(int i = 0; i < Nvec; i++) { 
     std::cout << " [ "; 
     for(int j = 0; j < N; j++) 
      std::cout << d_indices[i * N + j] << " "; 
     std::cout << "]\n"; 
    } 

    thrust::device_vector<float> d_squared_differences(Nvec * N); 

    thrust::transform(d_matrix1.begin(), d_matrix1.end(), d_matrix2.begin(), d_squared_differences.begin(), PowerDifference()); 

    thrust::device_vector<float> d_norms(Nvec); 
    thrust::reduce_by_key(d_indices.begin(), d_indices.end(), d_squared_differences.begin(), d_indices.begin(), d_norms.begin()); 

    printf("\n\ndnorms\n"); 
    for(int i = 0; i < Nvec; i++) { 
      std::cout << d_norms[i] << " "; 
    } 

    return 0; 
}

來源

2015-06-25 17:26:58 JackOLantern

使用CUDA計算矩陣的相應行之間的歐幾里得距離

回答

相關問題