2010-10-26 14 views
3

我對使用Boost MPI比較陌生。我已經安裝了庫,代碼編譯,但我得到一個非常奇怪的錯誤 - 從節點收到的一些整數數據不是由主人發送的。到底是怎麼回事?Boost.MPI:收到的並不是發送的內容!

我使用boost版本1.42.0,使用mpiC++編譯代碼(在一個羣集上封裝了g ++,在另一個羣集上封裝了icpc)。一個簡化的例子如下,包括輸出。

代碼:

#include <iostream> 
#include <boost/mpi.hpp> 

using namespace std; 
namespace mpi = boost::mpi; 

class Solution 
{ 
public: 
    Solution() : 
    solution_num(num_solutions++) 
    { 
    // Master node's constructor 
    } 

    Solution(int solutionNum) : 
    solution_num(solutionNum) 
    { 
    // Slave nodes' constructor. 
    } 

    int solutionNum() const 
    { 
    return solution_num; 
    } 

private: 
    static int num_solutions; 
    int solution_num; 
}; 

int Solution::num_solutions = 0; 

int main(int argc, char* argv[]) 
{ 
    // Initialization of MPI 
    mpi::environment env(argc, argv); 
    mpi::communicator world; 

    if (world.rank() == 0) 
    { 
    // Create solutions 
    int numSolutions = world.size() - 1; // One solution per slave 
    vector<Solution*> solutions(numSolutions); 
    for (int sol = 0; sol < numSolutions; ++sol) 
    { 
     solutions[sol] = new Solution; 
    } 

    // Send solutions 
    for (int sol = 0; sol < numSolutions; ++sol) 
    { 
     world.isend(sol + 1, 0, false); // Tells the slave to expect work 
     cout << "Sending solution no. " << solutions[sol]->solutionNum() << " to node " << sol + 1 << endl; 
     world.isend(sol + 1, 1, solutions[sol]->solutionNum()); 
    } 

    // Retrieve values (solution numbers squared) 
    vector<double> values(numSolutions, 0); 
    for (int i = 0; i < numSolutions; ++i) 
    { 
     // Get values for each solution 
     double value = 0; 
     mpi::status status = world.recv(mpi::any_source, 2, value); 
     int source = status.source(); 

     int sol = source - 1; 
     values[sol] = value; 
    } 
    for (int i = 1; i <= numSolutions; ++i) 
    { 
     world.isend(i, 0, true); // Tells the slave to finish 
    } 

    // Output the solutions numbers and their squares 
    for (int i = 0; i < numSolutions; ++i) 
    { 
     cout << solutions[i]->solutionNum() << ", " << values[i] << endl; 
     delete solutions[i]; 
    } 
    } 
    else 
    { 
    // Slave nodes merely square the solution number 
    bool finished; 
    mpi::status status = world.recv(0, 0, finished); 
    while (!finished) 
    { 
     int solNum; 
     world.recv(0, 1, solNum); 
     cout << "Node " << world.rank() << " receiving solution no. " << solNum << endl; 

     Solution solution(solNum); 
     double value = static_cast<double>(solNum * solNum); 
     world.send(0, 2, value); 

     status = world.recv(0, 0, finished); 
    } 

    cout << "Node " << world.rank() << " finished." << endl; 
    } 

    return EXIT_SUCCESS; 
} 

上21個節點(1個主,20從站)的運行,這產生:

Sending solution no. 0 to node 1 
Sending solution no. 1 to node 2 
Sending solution no. 2 to node 3 
Sending solution no. 3 to node 4 
Sending solution no. 4 to node 5 
Sending solution no. 5 to node 6 
Sending solution no. 6 to node 7 
Sending solution no. 7 to node 8 
Sending solution no. 8 to node 9 
Sending solution no. 9 to node 10 
Sending solution no. 10 to node 11 
Sending solution no. 11 to node 12 
Sending solution no. 12 to node 13 
Sending solution no. 13 to node 14 
Sending solution no. 14 to node 15 
Sending solution no. 15 to node 16 
Sending solution no. 16 to node 17 
Sending solution no. 17 to node 18 
Sending solution no. 18 to node 19 
Sending solution no. 19 to node 20 
Node 1 receiving solution no. 0 
Node 2 receiving solution no. 1 
Node 12 receiving solution no. 19 
Node 3 receiving solution no. 19 
Node 15 receiving solution no. 19 
Node 13 receiving solution no. 19 
Node 4 receiving solution no. 19 
Node 9 receiving solution no. 19 
Node 10 receiving solution no. 19 
Node 14 receiving solution no. 19 
Node 6 receiving solution no. 19 
Node 5 receiving solution no. 19 
Node 11 receiving solution no. 19 
Node 8 receiving solution no. 19 
Node 16 receiving solution no. 19 
Node 19 receiving solution no. 19 
Node 20 receiving solution no. 19 
Node 1 finished. 
Node 2 finished. 
Node 7 receiving solution no. 19 
0, 0 
1, 1 
2, 361 
3, 361 
4, 361 
5, 361 
6, 361 
7, 361 
8, 361 
9, 361 
10, 361 
11, 361 
12, 361 
13, 361 
14, 361 
15, 361 
16, 361 
17, 361 
18, 361 
19, 361 
Node 6 finished. 
Node 3 finished. 
Node 17 receiving solution no. 19 
Node 17 finished. 
Node 10 finished. 
Node 12 finished. 
Node 8 finished. 
Node 4 finished. 
Node 15 finished. 
Node 18 receiving solution no. 19 
Node 18 finished. 
Node 11 finished. 
Node 13 finished. 
Node 20 finished. 
Node 16 finished. 
Node 9 finished. 
Node 19 finished. 
Node 7 finished. 
Node 5 finished. 
Node 14 finished. 

因此,儘管主站發送0至節點1,1到節點2,2到節點3等,大多數從屬節點(出於某種原因)接收數字19.因此,不是產生從0到19的數字的平方,我們得到0平方,1平方和19平方18次!

在此先感謝任何能解釋這一點的人。

艾倫

回答

2

你的編譯器優化了廢話了你的「解決方案[溶膠] =新的解決方案;」循環,並得出結論:它可以跳到所有num_solution ++增量的末尾。這樣做當然是錯誤的,但那就是發生了什麼。

雖然不太可能,但是自動線程化或自動並行化的編譯器可能會導致numsolutions ++的20個實例相對於Solution的ctor列表中的20個solution_num = num_solutions實例以半隨機順序出現( )。優化更可能出現可怕的錯誤。

更換

 
for (int sol = 0; sol < numSolutions; ++sol) 
    { 
     solutions[sol] = new Solution; 
    } 

 
for (int sol = 0; sol < numSolutions; ++sol) 
    { 
     solutions[sol] = new Solution(sol); 
    } 

,你的問題就會迎刃而解。特別是,每個解決方案都會獲得自己的編號,而不是獲取編譯器對20個增量進行不正確重新排序期間共享靜態所發生的一些編號。

+0

沒有運氣我很害怕。解決方案編號在主節點上是正確的 - 它們只是沒有正確發送到從站。雖然它可能仍然是一些優化出錯 - 在將它們發送到從站之前將解決方案編號提取到矢量似乎適用於此簡單代碼,但不適用於我的「真實」代碼。 – 2010-10-26 16:24:11

11

好吧,我想我有答案,這需要一些底層C型MPI調用的知識。 Boost的'isend'函數本質上是'MPI_Isend'的一個包裝,它不能保護用戶不需要了解有關'MPI_Isend'如何工作的一些細節。

「MPI_Isend」的一個參數是指向包含您希望發送的信息的緩衝區的指針。但是,重要的是,只有在知道收到消息之前,該緩衝區才能被重用。因此,請考慮以下代碼:

// Get solution numbers from the solutions and store in a vector 
vector<int> solutionNums(numSolutions); 
for (int sol = 0; sol < numSolutions; ++sol) 
{ 
    solutionNums[sol] = solutions[sol]->solutionNum(); 
} 

// Send solution numbers 
for (int sol = 0; sol < numSolutions; ++sol) 
{ 
    world.isend(sol + 1, 0, false); // Indicates that we have not finished, and to expect a solution representation 
    cout << "Sending solution no. " << solutionNums[sol] << " to node " << sol + 1 << endl; 
    world.isend(sol + 1, 1, solutionNums[sol]); 
} 

這很好,因爲每個解決方案編號都位於內存中的自己的位置。現在考慮以下微調:

// Create solutionNum array 
vector<int> solutionNums(numSolutions); 
for (int sol = 0; sol < numSolutions; ++sol) 
{ 
    solutionNums[sol] = solutions[sol]->solutionNum(); 
} 

// Send solutions 
for (int sol = 0; sol < numSolutions; ++sol) 
{ 
    int solNum = solutionNums[sol]; 
    world.isend(sol + 1, 0, false); // Indicates that we have not finished, and to expect a solution representation 
    cout << "Sending solution no. " << solNum << " to node " << sol + 1 << endl; 
    world.isend(sol + 1, 1, solNum); 
} 

現在底層「MPI_Isend」呼叫設置有指針solNum。不幸的是,這個位的內存每次都在循環周圍被覆蓋,所以雖然可能看起來數字4被髮送到節點5,但是在發送實際發生時,該內存位置的新內容(例如19)被傳遞。

現在考慮的原代碼:

// Send solutions 
for (int sol = 0; sol < numSolutions; ++sol) 
{ 
    world.isend(sol + 1, 0, false); // Tells the slave to expect work 
    cout << "Sending solution no. " << solutions[sol]->solutionNum() << " to node " << sol + 1 << endl; 
    world.isend(sol + 1, 1, solutions[sol]->solutionNum()); 
} 

在這裏,我們通過一個臨時的。同樣,這個臨時存儲器的位置在循環中每次都會被覆蓋。同樣,錯誤的數據被髮送到從節點。

碰巧,我已經能夠重構我的'真正'的代碼來使用'發送'而不是'isend'。但是,如果將來需要使用'isend',我會更謹慎一些!

+1

我確實想知道boost的'isend'實現是否應該保護用戶多一些這些問題,如果它的目標是成爲一個更友好的MPI接口。我想這可能是通常的效率和安全之間的平衡行爲? – 2010-10-27 14:35:45

+1

+1用於記錄後人的答案 – Gorgen 2010-10-27 14:41:22

4

我想我今天偶然發現了一個類似的問題。在序列化自定義數據類型時,我注意到它在另一側(有時)被損壞。修復方法是存儲返回值isend。如果你看看中的communicator::isend_impl(int dest, int tag, const T& value, mpl::false_),你會看到序列化的數據作爲共享指針放入請求中。如果它再次被移除,則數據無效並可能發生任何事情。

所以:總是保存isend返回值!

1

建立在milianw的回答:我的印象是使用isend的正確方法是保持它返回的請求對象,並在另一次調用isend之前使用它的test()或wait()方法檢查它已完成。我認爲它也會繼續調用isend()並將請求對象推送到一個向量上。然後,您可以使用{test,wait} _ {any,some,all}來測試或等待這些請求。

在某些時候,您還需要擔心您發佈的發送速度是否比收件人能夠收到的速度快,因爲遲早您會用完MPI緩衝區。根據我的經驗,這隻會表現爲崩潰。

相關問題