opencv for循環與CUDA - 並行處理

我堅持一個問題，讓這個循環的迭代器在CUDA上工作。任何人都可以在這裏幫忙嗎？opencv for循環與CUDA - 並行處理

std::vector<cv::DMatch> matches; 
std::vector<cv::KeyPoint> key_pts1, key_pts2; 
std::vector<cv::Point2f> points1, points2; 
for (std::vector<cv::DMatch>::const_iterator itr = matches.begin(); itr!= matches.end(); ++it) 
     {     
      float x = key_pts1[itr->queryIdx].pt.x; 
      float y = key_pts1[itr->queryIdx].pt.y; 
      points1.push_back(cv::Point2f(x,y));     
      x = key_pts2[itr->trainIdx].pt.x; 
      y = key_pts2[itr->trainIdx].pt.y; 
      points2.push_back(cv::Point2f(x,y));    
     }

這上面轉化爲CUDA - 並行處理，因爲我曾經想過，似乎很難給我。

void dmatchLoopHomography(float *itr, float *match_being, float *match_end, float  *keypoint_1, float *keypoint_2, float *pts1, float *pts2) 
{ 
float x, y; 
// allocate memory in GPU memory 
unsigned char *mtch_begin, *mtch_end, *keypt_1, *keypt_2, points1, *points2; 
cudaHostGetDevicePointer(&mtch_begin, match_being, 0); 
cudaHostGetDevicePointer(&mtch_end, match_end, 0); 
cudaHostGetDevicePointer(&keypt_1, keypoint_1, 0); 
cudaHostGetDevicePointer(&keypt_2, keypoint_2, 0); 
cudaHostGetDevicePointer(&points1, pts1, 0); 
cudaHostGetDevicePointer(&points2, pts2, 0); 

//dim3 blocks(16, 16); 
dim3 threads(itr, itr); 
//kernal 
dmatchLoopHomography_ker<<<itr,itr>>>(mtch_begin, mtch_end, keypt_1, keypt_2, points1. points2) 
cudaThreadSynchronize();  
}

和

__global__ void dmatchLoopHomography_ker(float *itr, float *match_being, float *match_end, float *keypoint_1, float *keypoint_2, float *pts1, float *pts2) 
{ 
//how do I go about it ?? 
}

來源

2012-11-07 Mahesh

首先，我注意到，你的程序是由移動vector<KeyPoint>成vector<Point2f>結構。 OpenCV的一個非常好的一個班輪爲你這樣做：

using namespace cv; 
KeyPoint::convert(key_pts1, points1); //vector<KeyPoint> to vector<Point2f>

現在，讓我們來討論GPU的東西。事實證明cudaHostGetDevicePointer()不分配內存。你需要cudaMalloc()來分配內存。例如：

//compile with nvcc, not gcc 
float* device_matches; 
int match_length = matches.end() - matches.begin(); 
cudaMalloc(&device_matches, match_length*sizeof(float)); 
//do the same as above for key_pts1, key_pts2, points1, and points2

現在，device_matches只是一個普通的C數組，不是STL矢量。所以，你沒有迭代器。相反，你必須使用普通的數組索引。如果您確實需要GPU上的迭代器，請查看Thrust library。 Thrust非常方便，但缺點是Thrust只提供一組特定的預烘烤功能。

更大的問題是您是否想在GPU上執行此程序的特定部分。我建議使用GPU來計算真正的計算密集型東西（例如，實際的特徵匹配），但在數據格式之間移動數據（如您的示例代碼中）比特徵匹配便宜很多。另外，請記住，您經常必須在GPU上以不同於CPU的方式構造數據。這種重組不一定在計算上花費很大，但是您需要留出一些時間在白板上工作，撕掉頭髮等。最後，如果您認真對待GPU，可能值得工作通過一些簡單的GPU編程示例（我喜歡Dr. Dobbs Supercomputing for the Masses tutorials），採用GPU /並行類，或與某些GPU黑客朋友交談。

來源

2012-11-10 06:03:52 solvingPuzzles

謝謝@solvingPuzzles爲您的答案和意見..是的，我同意你的觀點，特徵匹配具有更高的計算需求量。我的問題的動機是理解如何解決這個簡單的問題，後來我可以通過我的時間學習GPU獲取的信息來獲取GPU上的功能匹配。感謝您與Dr.Dobbs的鏈接..是的，我一直在工作中，以及..順便說一句感謝關鍵點::轉換，力德之前知道它... – Mahesh

@timothy很好！作爲一個玩具的例子，我建議在CUDA中編碼矩陣乘法。隨意「欺騙」並查看示例代碼。這段代碼不會非常冗長，但是你會學習CUDA的鍋爐板材（像threadIdx，blockIdx，cudaMemcpy，grid，block等等） – solvingPuzzles

opencv for循環與CUDA - 並行處理

回答

相關問題