2017-09-01 183 views
0

我正在Bluehive中運行代碼。代碼有一些參數N.如果N很小,那麼代碼運行得很好。但是,對於稍微大的N(例如N = 10)的碼被運行數個小時,並在結束時我收到以下錯誤消息:slurmstepd:錯誤:在某個點超出步驟內存限制

slurmstepd: error: Exceeded step memory limit at some point. 

其中我提交批處理文件有以下代碼:

#!/bin/bash 
#SBATCH -o log.%a.txt -t 3-01:01:00 
#SBATCH --mem-per-cpu=1gb 
#SBATCH -c 4 
#SBATCH --gres=gpu:1 
#SBATCH -J Ankani 
#SBATCH -a 1-2 

python run.py $SLURM_ARRAY_TASK_ID 

我爲代碼分配了足夠的內存。但仍然得到錯誤

"slurmstepd: error: Exceeded step memory limit at some point." 

有人可以幫忙嗎?

回答

0

但是,我會注意到,此錯誤消息中「步驟內存限制」所描述的內存限制不一定與您的進程的RSS有關。此限制被提供並通過該cgroup插件執行,而存儲器的cgroup

track not only RSS of tasks in your job but file cache, mmap pages, etc. If I had to guess you are hitting memory limit due to page cache. In that case, you might be able to just ignore this error since hitting the limit here probably just triggered memory reclaim which freed cached pages (this shouldn't be a fatal error).

If you'd like to avoid the error, and you're only writing out data and don't want it cached, then you could try playing with posix_fadvise(2) using the POSIX_FADV_DONTNEED which hints to the VM that you aren't going to read the pages you're writing out again.

這裏是the source of this text