2013-02-08 80 views
0

我已經安裝了openmpi,而不是在/usr/...中,但是在/commun/data/packages/openmpi/中,它已編譯爲--with-sge提交打開MPI作業到SGE

我已在SGE新的PE爲descibed在http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html

# /commun/data/packages/openmpi/bin/ompi_info | grep gridengine 
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3) 

# qconf -sq all.q | grep pe_ 
pe_list    make orte 

沒有SGE,程序運行沒有任何問題,使用多個處理器。

/commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args 

現在,我想我的程序提交到SGE

在打開MPI FAQ,我讀:

# Allocate a SGE interactive job with 4 slots 
# from a parallel environment (PE) named 'orte' 
shell$ qsh -pe orte 4 

,但我的輸出是:

qsh -pe orte 4 
Your job 84550 ("INTERACTIVE") has been submitted 
waiting for interactive job to be scheduled ... 
Could not start interactive job. 

我我也試過嵌入腳本中的mpirun命令:

$ cat ompi.sh 
#!/bin/sh 
/commun/data/packages/openmpi/bin/mpirun \ 
    /path/to/a.out args 

,但它無法

$ cat ompi.sh.e84552 
error: executing task of job 84552 failed: execution daemon on host "node02" didn't accept task 
-------------------------------------------------------------------------- 
A daemon (pid 18327) died unexpectedly with status 1 while attempting 
to launch so we are aborting. 

There may be more information reported by the environment (see above). 

This may be because the daemon was unable to find all the needed shared 
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the 
location of the shared libraries on the remote nodes and this will 
automatically be forwarded to the remote nodes. 
-------------------------------------------------------------------------- 
error: executing task of job 84552 failed: execution daemon on host "node01" didn't accept task 
-------------------------------------------------------------------------- 
mpirun noticed that the job aborted, but has no info as to the process 
that caused that situation. 

我該如何解決這個問題?


搶答了openmpi郵件列表:http://www.open-mpi.org/community/lists/users/2013/02/21360.php

回答

0

在我的情況下設置 「job_is_first_task FALSE」 和 「control_slaves TRUE」 解決了這個問題。

# qconf -mp mpi1 

pe_name   mpi1 
slots    9 
user_lists   NONE 
xuser_lists  NONE 
start_proc_args /bin/true 
stop_proc_args  /bin/true 
allocation_rule $fill_up 
control_slaves  TRUE 
job_is_first_task FALSE 
urgency_slots  min 
accounting_summary FALSE