我對使用Boost MPI比較陌生。我已經安裝了庫,代碼編譯,但我得到一個非常奇怪的錯誤 - 從節點收到的一些整數數據不是由主人發送的。到底是怎麼回事?Boost.MPI:收到的並不是發送的內容!

我使用boost版本1.42.0,使用mpiC++編譯代碼(在一個羣集上封裝了g ++,在另一個羣集上封裝了icpc)。一個簡化的例子如下,包括輸出。


#include <iostream> 
#include <boost/mpi.hpp> 

using namespace std; 
namespace mpi = boost::mpi; 

class Solution 
    Solution() : 
    // Master node's constructor 

    Solution(int solutionNum) : 
    // Slave nodes' constructor. 

    int solutionNum() const 
    return solution_num; 

    static int num_solutions; 
    int solution_num; 

int Solution::num_solutions = 0; 

int main(int argc, char* argv[]) 
    // Initialization of MPI 
    mpi::environment env(argc, argv); 
    mpi::communicator world; 

    if (world.rank() == 0) 
    // Create solutions 
    int numSolutions = world.size() - 1; // One solution per slave 
    vector<Solution*> solutions(numSolutions); 
    for (int sol = 0; sol < numSolutions; ++sol) 
     solutions[sol] = new Solution; 

    // Send solutions 
    for (int sol = 0; sol < numSolutions; ++sol) 
     world.isend(sol + 1, 0, false); // Tells the slave to expect work 
     cout << "Sending solution no. " << solutions[sol]->solutionNum() << " to node " << sol + 1 << endl; 
     world.isend(sol + 1, 1, solutions[sol]->solutionNum()); 

    // Retrieve values (solution numbers squared) 
    vector<double> values(numSolutions, 0); 
    for (int i = 0; i < numSolutions; ++i) 
     // Get values for each solution 
     double value = 0; 
     mpi::status status = world.recv(mpi::any_source, 2, value); 
     int source = status.source(); 

     int sol = source - 1; 
     values[sol] = value; 
    for (int i = 1; i <= numSolutions; ++i) 
     world.isend(i, 0, true); // Tells the slave to finish 

    // Output the solutions numbers and their squares 
    for (int i = 0; i < numSolutions; ++i) 
     cout << solutions[i]->solutionNum() << ", " << values[i] << endl; 
     delete solutions[i]; 
    // Slave nodes merely square the solution number 
    bool finished; 
    mpi::status status = world.recv(0, 0, finished); 
    while (!finished) 
     int solNum; 
     world.recv(0, 1, solNum); 
     cout << "Node " << world.rank() << " receiving solution no. " << solNum << endl; 

     Solution solution(solNum); 
     double value = static_cast<double>(solNum * solNum); 
     world.send(0, 2, value); 

     status = world.recv(0, 0, finished); 

    cout << "Node " << world.rank() << " finished." << endl; 

    return EXIT_SUCCESS; 


Sending solution no. 0 to node 1 
Sending solution no. 1 to node 2 
Sending solution no. 2 to node 3 
Sending solution no. 3 to node 4 
Sending solution no. 4 to node 5 
Sending solution no. 5 to node 6 
Sending solution no. 6 to node 7 
Sending solution no. 7 to node 8 
Sending solution no. 8 to node 9 
Sending solution no. 9 to node 10 
Sending solution no. 10 to node 11 
Sending solution no. 11 to node 12 
Sending solution no. 12 to node 13 
Sending solution no. 13 to node 14 
Sending solution no. 14 to node 15 
Sending solution no. 15 to node 16 
Sending solution no. 16 to node 17 
Sending solution no. 17 to node 18 
Sending solution no. 18 to node 19 
Sending solution no. 19 to node 20 
Node 1 receiving solution no. 0 
Node 2 receiving solution no. 1 
Node 12 receiving solution no. 19 
Node 3 receiving solution no. 19 
Node 15 receiving solution no. 19 
Node 13 receiving solution no. 19 
Node 4 receiving solution no. 19 
Node 9 receiving solution no. 19 
Node 10 receiving solution no. 19 
Node 14 receiving solution no. 19 
Node 6 receiving solution no. 19 
Node 5 receiving solution no. 19 
Node 11 receiving solution no. 19 
Node 8 receiving solution no. 19 
Node 16 receiving solution no. 19 
Node 19 receiving solution no. 19 
Node 20 receiving solution no. 19 
Node 1 finished. 
Node 2 finished. 
Node 7 receiving solution no. 19 
0, 0 
1, 1 
2, 361 
3, 361 
4, 361 
5, 361 
6, 361 
7, 361 
8, 361 
9, 361 
10, 361 
11, 361 
12, 361 
13, 361 
14, 361 
15, 361 
16, 361 
17, 361 
18, 361 
19, 361 
Node 6 finished. 
Node 3 finished. 
Node 17 receiving solution no. 19 
Node 17 finished. 
Node 10 finished. 
Node 12 finished. 
Node 8 finished. 
Node 4 finished. 
Node 15 finished. 
Node 18 receiving solution no. 19 
Node 18 finished. 
Node 11 finished. 
Node 13 finished. 
Node 20 finished. 
Node 16 finished. 
Node 9 finished. 
Node 19 finished. 
Node 7 finished. 
Node 5 finished. 
Node 14 finished. 






你的編譯器優化了廢話了你的「解決方案[溶膠] =新的解決方案;」循環,並得出結論:它可以跳到所有num_solution ++增量的末尾。這樣做當然是錯誤的,但那就是發生了什麼。

雖然不太可能,但是自動線程化或自動並行化的編譯器可能會導致numsolutions ++的20個實例相對於Solution的ctor列表中的20個solution_num = num_solutions實例以半隨機順序出現( )。優化更可能出現可怕的錯誤。


for (int sol = 0; sol < numSolutions; ++sol) 
     solutions[sol] = new Solution; 

for (int sol = 0; sol < numSolutions; ++sol) 
     solutions[sol] = new Solution(sol); 



沒有運氣我很害怕。解決方案編號在主節點上是正確的 - 它們只是沒有正確發送到從站。雖然它可能仍然是一些優化出錯 - 在將它們發送到從站之前將解決方案編號提取到矢量似乎適用於此簡單代碼,但不適用於我的「真實」代碼。 – 2010-10-26 16:24:11


好吧,我想我有答案,這需要一些底層C型MPI調用的知識。 Boost的'isend'函數本質上是'MPI_Isend'的一個包裝,它不能保護用戶不需要了解有關'MPI_Isend'如何工作的一些細節。


// Get solution numbers from the solutions and store in a vector 
vector<int> solutionNums(numSolutions); 
for (int sol = 0; sol < numSolutions; ++sol) 
    solutionNums[sol] = solutions[sol]->solutionNum(); 

// Send solution numbers 
for (int sol = 0; sol < numSolutions; ++sol) 
    world.isend(sol + 1, 0, false); // Indicates that we have not finished, and to expect a solution representation 
    cout << "Sending solution no. " << solutionNums[sol] << " to node " << sol + 1 << endl; 
    world.isend(sol + 1, 1, solutionNums[sol]); 


// Create solutionNum array 
vector<int> solutionNums(numSolutions); 
for (int sol = 0; sol < numSolutions; ++sol) 
    solutionNums[sol] = solutions[sol]->solutionNum(); 

// Send solutions 
for (int sol = 0; sol < numSolutions; ++sol) 
    int solNum = solutionNums[sol]; 
    world.isend(sol + 1, 0, false); // Indicates that we have not finished, and to expect a solution representation 
    cout << "Sending solution no. " << solNum << " to node " << sol + 1 << endl; 
    world.isend(sol + 1, 1, solNum); 



// Send solutions 
for (int sol = 0; sol < numSolutions; ++sol) 
    world.isend(sol + 1, 0, false); // Tells the slave to expect work 
    cout << "Sending solution no. " << solutions[sol]->solutionNum() << " to node " << sol + 1 << endl; 
    world.isend(sol + 1, 1, solutions[sol]->solutionNum()); 




我確實想知道boost的'isend'實現是否應該保護用戶多一些這些問題,如果它的目標是成爲一個更友好的MPI接口。我想這可能是通常的效率和安全之間的平衡行爲? – 2010-10-27 14:35:45


+1用於記錄後人的答案 – Gorgen 2010-10-27 14:41:22


我想我今天偶然發現了一個類似的問題。在序列化自定義數據類型時,我注意到它在另一側(有時)被損壞。修復方法是存儲返回值isend。如果你看看中的communicator::isend_impl(int dest, int tag, const T& value, mpl::false_),你會看到序列化的數據作爲共享指針放入請求中。如果它再次被移除,則數據無效並可能發生任何事情。



建立在milianw的回答:我的印象是使用isend的正確方法是保持它返回的請求對象,並在另一次調用isend之前使用它的test()或wait()方法檢查它已完成。我認爲它也會繼續調用isend()並將請求對象推送到一個向量上。然後,您可以使用{test,wait} _ {any,some,all}來測試或等待這些請求。

