2017-06-29 65 views
0

我在快樂蟒蛇訓練的LSTM使用Tensorflow 1.2GTX 1060 6GBTensorflow資源耗盡而無需任何資源枯竭

在每個時期我保存模型用這種方法:

def save_model(self,session,epoch,save_model_path): 

    save_path = self.saver.save(session, save_model_path + "lstm_model_epoch_" + str(epoch) + ".ckpt") 
    print("Model saved in file: %s" % save_path) 

一切正常,但經過九年劃時代我得到ResourceExhaustedError當我嘗試將模型保存此方法。

我在培訓期間檢查了我的資源,但沒有資源耗盡。

的錯誤,我得到的是這樣的:

2017-06-29 12:43:02.865845: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: log/example_0/lstm_models/lstm_model_epoch_9.ckpt.data-00000-of-00001.tempstate10865381291487648358 Traceback (most recent call last): File "main.py", line 32, in File "/home/alb3rto/Scrivania/Tesi/sentiment_classification/text_lstm/LSTM_sentence.py", line 306, in train_lstm File "/home/alb3rto/Scrivania/Tesi/sentiment_classification/text_lstm/LSTM_sentence.py", line 449, in save_model File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1472, in save File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call tensorflow.python.framework.errors_impl.ResourceExhaustedError: log/example_0/lstm_models/lstm_model_epoch_9.ckpt.data-00000-of-00001.tempstate10865381291487648358 [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, Variable/_21, Variable/Adam/_23, Variable/Adam_1/_25, Variable_1/_27, Variable_1/Adam/_29, Variable_1/Adam_1/_31, beta1_power/_33, beta2_power/_35, rnn/basic_lstm_cell/bias/_37, rnn/basic_lstm_cell/bias/Adam/_39, rnn/basic_lstm_cell/bias/Adam_1/_41, rnn/basic_lstm_cell/kernel/_43, rnn/basic_lstm_cell/kernel/Adam/_45, rnn/basic_lstm_cell/kernel/Adam_1/_47)]] Caused by op u'save/SaveV2', defined at: File "main.py", line 28, in lstm_sentence = lstm() File "/home/alb3rto/Scrivania/Tesi/sentiment_classification/text_lstm/LSTM_sentence.py", line 18, in init File "/home/alb3rto/Scrivania/Tesi/sentiment_classification/text_lstm/LSTM_sentence.py", line 117, in build_lstm File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in init self.build() File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build restore_sequentially=self._restore_sequentially) File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 689, in build save_tensor = self._AddSaveOps(filename_tensor, saveables) File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 276, in _AddSaveOps save = self.save_op(filename_tensor, saveables) File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 219, in save_op tensors) File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 745, in save_v2 File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/alb3rto/anaconda2/envs/tesi/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init self._traceback = _extract_stack() ResourceExhaustedError (see above for traceback): log/example_0/lstm_models/lstm_model_epoch_9.ckpt.data-00000-of-00001.tempstate10865381291487648358 [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, Variable/_21, Variable/Adam/_23, Variable/Adam_1/_25, Variable_1/_27, Variable_1/Adam/_29, Variable_1/Adam_1/_31, beta1_power/_33, beta2_power/_35, rnn/basic_lstm_cell/bias/_37, rnn/basic_lstm_cell/bias/Adam/_39, rnn/basic_lstm_cell/bias/Adam_1/_41, rnn/basic_lstm_cell/kernel/_43, rnn/basic_lstm_cell/kernel/Adam/_45, rnn/basic_lstm_cell/kernel/Adam_1/_47)]]

我怎樣才能解決呢?

回答

0

當遇到對GPU OOMResourceExausted Error我相信變化(減少)batch size是儘量在第一正確的選擇。

For different GPU you may need different batch size based on the GPU memory you have.

最近我遇到了類似的問題,調整了很多,做了不同類型的實驗。

這裏是鏈接到question(也包括一些技巧)。

但是,在減少批量大小的同時,您可能會發現您的訓練速度會變慢。所以如果你有多個GPU,你可以使用它們。要檢查你的GPU,你可以在終端上寫入,

nvidia-smi 

它會告訴你關於你的GPU機架的必要信息。