我正在學習集羣上的OpenMPI。這是我的第一個例子。我期望輸出將顯示來自不同節點的響應,但它們都從同一節點node062響應。我只是想知道爲什麼以及如何從不同的節點獲得報告來顯示MPI實際上是將流程分發到不同的節點?感謝致敬!在集羣上測試MPI
ex1.c中
/* test of MPI */
#include "mpi.h"
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
char idstr[2232]; char buff[22128];
char processor_name[MPI_MAX_PROCESSOR_NAME];
int numprocs; int myid; int i; int namelen;
MPI_Status stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name, &namelen);
if(myid == 0)
{
printf("WE have %d processors\n", numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d", i);
MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }
for(i=1;i<numprocs;i++)
{
MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);
printf("%s\n", buff);
}
}
else
{
MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);
sprintf(idstr, " Processor %d at node %s ", myid, processor_name);
strcat(buff, idstr);
strcat(buff, "reporting for duty\n");
MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
}
ex1.pbs
#!/bin/sh
#
#This is an example script example.sh
#
#These commands set up the Grid Environment for your job:
#PBS -N ex1
#PBS -l nodes=10:ppn=1,walltime=1:10:00
#PBS -q dque
# export OMP_NUM_THREADS=4
mpirun -np 10 /home/tim/courses/MPI/examples/ex1
編譯和運行:
[[email protected] examples]$ mpicc ./ex1.c -o ex1
[[email protected] examples]$ qsub ex1.pbs
35540.mgt
[[email protected] examples]$ nano ex1.o35540
----------------------------------------
Begin PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883
Job ID: 35540.mgt
Username: tim
Group: Brown
Nodes: node062 node063 node169 node170 node171 node172 node174 node175
node176 node177
End PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883
----------------------------------------
WE have 10 processors
Hello 1 Processor 1 at node node062 reporting for duty
Hello 2 Processor 2 at node node062 reporting for duty
Hello 3 Processor 3 at node node062 reporting for duty
Hello 4 Processor 4 at node node062 reporting for duty
Hello 5 Processor 5 at node node062 reporting for duty
Hello 6 Processor 6 at node node062 reporting for duty
Hello 7 Processor 7 at node node062 reporting for duty
Hello 8 Processor 8 at node node062 reporting for duty
Hello 9 Processor 9 at node node062 reporting for duty
----------------------------------------
Begin PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891
Job ID: 35540.mgt
Username: tim
Group: Brown
Job Name: ex1
Session: 15533
Limits: neednodes=10:ppn=1,nodes=10:ppn=1,walltime=01:10:00
Resources: cput=00:00:00,mem=420kb,vmem=8216kb,walltime=00:00:03
Queue: dque
Account:
Nodes: node062 node063 node169 node170 node171 node172 node174 node175 node176
node177
Killing leftovers...
End PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891
----------------------------------------
UPDATE:
我想在一個PBS腳本中運行多個後臺作業,以便作業可以同時運行。例如在上面的例子中,我添加另一個調用運行EX1和改變兩個試驗是在ex1.pbs
#!/bin/sh
#
#This is an example script example.sh
#
#These commands set up the Grid Environment for your job:
#PBS -N ex1
#PBS -l nodes=10:ppn=1,walltime=1:10:00
#PBS -q dque
echo "The first job starts!"
mpirun -np 5 --machinefile /home/tim/courses/MPI/examples/machinefile /home/tim/courses/MPI/examples/ex1 &
echo "The first job ends!"
echo "The second job starts!"
mpirun -np 5 --machinefile /home/tim/courses/MPI/examples/machinefile /home/tim/courses/MPI/examples/ex1 &
echo "The second job ends!"
(1)的結果是這個腳本的qsub後細與先前編譯的可執行EX1背景。
The first job starts!
The first job ends!
The second job starts!
The second job ends!
WE have 5 processors
WE have 5 processors
Hello 1 Processor 1 at node node063 reporting for duty
Hello 2 Processor 2 at node node169 reporting for duty
Hello 3 Processor 3 at node node170 reporting for duty
Hello 1 Processor 1 at node node063 reporting for duty
Hello 4 Processor 4 at node node171 reporting for duty
Hello 2 Processor 2 at node node169 reporting for duty
Hello 3 Processor 3 at node node170 reporting for duty
Hello 4 Processor 4 at node node171 reporting for duty
(2)但是,我認爲EX1的運行時間太快,可能這兩個後臺作業沒有太多的運行時間重疊,當我以同樣的方式適用於我的實際是不是這樣的項目。於是我將睡眠(30)添加到ex1.c中,以延長ex1的運行時間,這樣兩個在後臺運行ex1的作業幾乎可以同時運行。
/* test of MPI */
#include "mpi.h"
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char **argv)
{
char idstr[2232]; char buff[22128];
char processor_name[MPI_MAX_PROCESSOR_NAME];
int numprocs; int myid; int i; int namelen;
MPI_Status stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name, &namelen);
if(myid == 0)
{
printf("WE have %d processors\n", numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d", i);
MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }
for(i=1;i<numprocs;i++)
{
MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);
printf("%s\n", buff);
}
}
else
{
MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);
sprintf(idstr, " Processor %d at node %s ", myid, processor_name);
strcat(buff, idstr);
strcat(buff, "reporting for duty\n");
MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
}
sleep(30); // new added to extend the running time
MPI_Finalize();
}
但重新編譯和qsub再次後,結果似乎並不好。有進程中止。 在ex1.o35571:
The first job starts!
The first job ends!
The second job starts!
The second job ends!
WE have 5 processors
WE have 5 processors
Hello 1 Processor 1 at node node063 reporting for duty
Hello 2 Processor 2 at node node169 reporting for duty
Hello 3 Processor 3 at node node170 reporting for duty
Hello 4 Processor 4 at node node171 reporting for duty
Hello 1 Processor 1 at node node063 reporting for duty
Hello 2 Processor 2 at node node169 reporting for duty
Hello 3 Processor 3 at node node170 reporting for duty
Hello 4 Processor 4 at node node171 reporting for duty
4 additional processes aborted (not shown)
4 additional processes aborted (not shown)
在ex1.e35571
:
mpirun: killing job...
mpirun noticed that job rank 0 with PID 25376 on node node062 exited on signal 15 (Terminated).
mpirun: killing job...
mpirun noticed that job rank 0 with PID 25377 on node node062 exited on signal 15 (Terminated).
我不知道爲什麼有進程中止?我如何在PBS腳本中正確地背景作業?
非常感謝!這解決了我的問題。如果其他人爲自己保留了一些節點,或者某些節點正在運行其他作業,那麼mpiexec會提供$ PBS_NODEFILE作爲機器文件通知?你也可以嘗試回答我在http://superuser.com/questions/102812/torch-in-cluster上發佈的關於在集羣上使用PBS的問題嗎?提前致謝! – Tim 2010-01-31 03:46:52