0
我請求14個處理器從一個一個(每個都有32)所示:PBS保持放棄我的工作
#PBS -l nodes=1:ppn=14
#PBS -l walltime=12:00:00
而且具有較低的ppn
它幾乎總是工作,但一旦我的數字高於14- ish,工作開始執行並立即終止。 tracejob
是奇無益:
tracejob 14753.hpc2
Job: 14753.hpc2
01/21/2017 11:12:36 L Considering job to run
01/21/2017 11:12:36 L Job run
01/21/2017 11:12:36 M Resource_List.place = scatter
01/21/2017 11:12:36 M make_cpuset, vnode hpc2[0]: hv_ncpus (2) > mvi_acpus (0) (you are not expected to understand this)
01/21/2017 11:12:36 M start_exec, new_cpuset failed
01/21/2017 11:12:36 M kill_job
01/21/2017 11:12:36 M hpc2 cput= 0:00:00 mem=0kb
01/21/2017 11:12:37 M Obit sent
01/21/2017 11:12:37 M copy file request received
01/21/2017 11:12:37 M staged 2 items out over 0:00:00
01/21/2017 11:12:37 M delete job request received
01/21/2017 11:12:37 M delete job request received
01/21/2017 11:12:38 M no active tasks
01/21/2017 11:12:38 M delete job request received
我有次成功要求更多的CPU,所以它不是完全確定。有沒有一種方法來調試呢?
作爲一個側面節點,請求多個節點的任何作業永遠都在隊列中,永遠不會啓動,我不知道這是否相關。
您使用的是什麼資源管理器和版本?調度程序的同樣問題。 – clusterdude