2015-11-18 107 views
1

我有一個C++解算器,我需要使用下面的命令並行運行:的mpirun:無法識別的參數MCA

nohup mpirun -np 16 ./my_exec > log.txt & 

此命令將我的節點上獨立地對可用16級的處理器上運行my_exec。這用於完美地工作。

上週,HPC部門進行了操作系統升級,現在,當啓動相同的命令時,我收到兩條警告消息(針對每個處理器)。第一個是:

--------------------------------------------------------------------------       
2 WARNING: It appears that your OpenFabrics subsystem is configured to only        
3 allow registering part of your physical memory. This can cause MPI jobs to       
4 run with erratic performance, hang, and/or crash.              
5                          
6 This may be caused by your OpenFabrics vendor limiting the amount of         
7 physical memory that can be registered. You should investigate the         
8 relevant Linux kernel module parameters that control how much physical        
9 memory can be registered, and increase them to allow registering all         
10 physical memory on your machine.                  
11                          
12 See this Open MPI FAQ item for more information on these Linux kernel module       
13 parameters:                       
14                          
15  http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages         
16                          
17 Local host:    tamnun                  
18 Registerable memory:  32768 MiB                 
19 Total memory:   98294 MiB                 
20                          
21 Your MPI job will continue, but may be behave poorly and/or hang.          
22 --------------------------------------------------------------------------       
23 --------------------------------------------------------------------------   

然後我從我的代碼,它告訴我它認爲我只發射1實現代碼(Nprocs = 1,而不是16)得到的輸出。

177                          
178 # MPI IS ON; Nprocs = 1                    
179 Filename = ../input/odtParam.inp                  
180                          
181 # MPI IS ON; Nprocs = 1                    
182                          
183 ***** Error, process 0 failed to create ../data/data_0/, or it was already there 

最後,第二警告信息是:

185 --------------------------------------------------------------------------       
186 An MPI process has executed an operation involving a call to the          
187 "fork()" system call to create a child process. Open MPI is currently        
188 operating in a condition that could result in memory corruption or         
189 other system errors; your MPI job may hang, crash, or produce silent         
190 data corruption. The use of fork() (or system() or other calls that         
191 create child processes) is strongly discouraged.              
192                          
193 The process that invoked fork was:                 
194                          
195 Local host:   tamnun (PID 17446)                
196 MPI_COMM_WORLD rank: 0                    
197                          
198 If you are *absolutely sure* that your application will successfully         
199 and correctly survive a call to fork(), you may disable this warning         
200 by setting the mpi_warn_on_fork MCA parameter to 0.             
201 --------------------------------------------------------------------------  

四處尋找在線後,我試圖通過MCA參數mpi_warn_on_fork用命令設置爲0以下的警告信息建議:

nohup mpirun --mca mpi_warn_on_fork 0 -np 16 ./my_exec > log.txt & 

其中產生以下錯誤信息:

[[email protected]] match_arg (./utils/args/args.c:194): unrecognized argument mca 
[[email protected]] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error 
[[email protected]] parse_args (./ui/mpich/utils.c:2964): error parsing input array 
[[email protected]] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments 

我使用的是RedHat 6.7(聖地亞哥)。我聯繫了HPC部門,但由於我在大學,這個問題可能需要一兩天才能做出迴應。任何幫助或指導,將不勝感激。響應

編輯回答:

事實上,我編譯我與Open MPI的mpic++代碼在運行英特爾的mpirun命令的執行,因此錯誤(OS升級後英特爾mpirun被設置爲默認)。我必須將Open MPI的mpirun的路徑放在$PATH環境變量的開頭。

代碼現在按預期方式運行,但我仍然得到上面的第一條警告消息(它不建議我再使用MCA參數mpi_warn_on_fork)我認爲(但不確定)這是我需要解決的問題HPC部門

+0

你在commad中有一個錯字是:mpi_warn_on_fork(你寫的作品) – Marco

+0

哈對,我用命令的權利,錯字發佈了問題。 – solalito

回答

1
[[email protected]] match_arg (./utils/args/args.c:194): unrecognized argument mca 
[[email protected]] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error 
[[email protected]] parse_args (./ui/mpich/utils.c:2964): error parsing input array 
            ^^^^^ 
[[email protected]] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments 
                ^^^^^ 

您在最後一種情況下使用MPICH MPICH是不開放MPI及其進程啓動不承認--mca參數是特定於Open MPI(MCA代表模塊化組件架構。 - 基本Open MPI構建於此框架之上)一個混合多個MPI實現的典型案例

+0

感謝您的回答!但是,我確定從哪裏開始修復它。有什麼建議? – solalito

+0

首先了解哪些MPI實現安裝在機器上,以及如何在它們之間切換。另外,請確保您使用來自用於編譯該程序的相同實現中的'mpirun'。使用MPI運行時(反之亦然)進行編譯並不能正常工作,因此您會得到一堆單例進程,它們在您自己的'MPI_COMM_WORLD'中的排名爲0,正如您已經觀察到的那樣。 –

+0

看我的編輯。我會接受你的答案,因爲它使我走上了正確的軌道(一旦問題被診斷出來,解決方案很容易)。 – solalito